GAN-Inspired Agent Architecture β Generator-Evaluator Loops
The self-evaluation problem
When agents evaluate their own work, they reliably skew positive. Even when output quality is obviously mediocre to a human observer, the agent will confidently praise its results. This is especially pronounced for subjective tasks like visual design where there is no binary pass/fail equivalent to a software test.
Even on tasks with verifiable outcomes, agents exhibit poor judgment about their own work. The underlying dynamic: an LLM generating output and then assessing that same output within the same context is structurally biased toward approval.
The architectural fix
Separate the roles into distinct agents:
| Agent | Role | Key behavior |
|---|---|---|
| Generator | Produces output (code, design, content) | Iterates based on evaluator feedback |
| Evaluator | Grades output against criteria, writes critiques | Tuned for skepticism and thoroughness |
The separation does not eliminate leniency on its own β the evaluator is still an LLM inclined to be generous. But tuning a standalone evaluator to be skeptical turns out to be far more tractable than making a generator critical of its own work. Once external feedback exists, the generator has something concrete to iterate against.
Source: Harness design for long-running application development (Prithvi Rajasekaran, Anthropic, March 2026)
Iteration dynamics
Across runs, evaluator assessments improve over iterations before plateauing, with headroom remaining. The pattern is not always linear:
- Later implementations tend to be better as a whole
- Middle iterations sometimes beat the final one on specific dimensions
- Implementation complexity increases across rounds as the generator reaches for more ambitious solutions
- Even the first iteration outperforms a no-prompting baseline, suggesting the criteria themselves steer the model before any feedback loop begins
In one example, a Dutch art museum website was refined through nine iterations of a polished dark-themed landing page. On the tenth cycle, the generator scrapped the approach entirely and reimagined the site as a 3D spatial experience with CSS perspective rendering and doorway-based navigation β a creative leap not seen from single-pass generation.
Scaling to three agents
For full-stack development, the pattern extends to a planner-generator-evaluator triad:
- Planner β expands a short prompt into a full product spec (ambitious scope, high-level technical design, avoids specifying granular implementation details that could cascade errors)
- Generator β implements features sprint by sprint, self-evaluates before QA handoff
- Evaluator β uses browser automation to interact with the running application, grades against sprint contracts, files specific bugs
The three-agent version produced applications that were dramatically more functional than single-agent baselines β the solo agentβs core feature was broken, while the harness versionβs core features worked.
Cost and tradeoffs
The harness is expensive. A retro game maker comparison:
| Approach | Duration | Cost |
|---|---|---|
| Solo agent | 20 min | $9 |
| Full harness | 6 hr | $200 |
Over 20x more expensive, but the quality gap justified it β the solo run produced a broken core feature while the harness run produced a working, polished application with AI integration.
Related Notes
- Agent Harnesses β parent concept; the harness is the orchestration layer where this pattern lives
- Criteria-Based Grading for Subjective Agent Output β the evaluation criteria that make the evaluator effective
- Sprint Contracts - Negotiated Agent Agreements β how generator and evaluator align on βdoneβ before work begins
- Harness Simplification as Models Improve β how to strip scaffolding as model capabilities increase
- Context Engineering β context resets were essential for earlier harness versions; compaction replaced them as models improved