GAN-Inspired Agent Architecture — Generator-Evaluator Loops

The self-evaluation problem

When agents evaluate their own work, they reliably skew positive. Even when output quality is obviously mediocre to a human observer, the agent will confidently praise its results. This is especially pronounced for subjective tasks like visual design where there is no binary pass/fail equivalent to a software test.

Even on tasks with verifiable outcomes, agents exhibit poor judgment about their own work. The underlying dynamic: an LLM generating output and then assessing that same output within the same context is structurally biased toward approval.

The architectural fix

Separate the roles into distinct agents:

Agent	Role	Key behavior
Generator	Produces output (code, design, content)	Iterates based on evaluator feedback
Evaluator	Grades output against criteria, writes critiques	Tuned for skepticism and thoroughness

The separation does not eliminate leniency on its own — the evaluator is still an LLM inclined to be generous. But tuning a standalone evaluator to be skeptical turns out to be far more tractable than making a generator critical of its own work. Once external feedback exists, the generator has something concrete to iterate against.

Source: Harness design for long-running application development (Prithvi Rajasekaran, Anthropic, March 2026)

Iteration dynamics

Across runs, evaluator assessments improve over iterations before plateauing, with headroom remaining. The pattern is not always linear:

Later implementations tend to be better as a whole
Middle iterations sometimes beat the final one on specific dimensions
Implementation complexity increases across rounds as the generator reaches for more ambitious solutions
Even the first iteration outperforms a no-prompting baseline, suggesting the criteria themselves steer the model before any feedback loop begins

In one example, a Dutch art museum website was refined through nine iterations of a polished dark-themed landing page. On the tenth cycle, the generator scrapped the approach entirely and reimagined the site as a 3D spatial experience with CSS perspective rendering and doorway-based navigation — a creative leap not seen from single-pass generation.

Scaling to three agents

For full-stack development, the pattern extends to a planner-generator-evaluator triad:

Planner — expands a short prompt into a full product spec (ambitious scope, high-level technical design, avoids specifying granular implementation details that could cascade errors)
Generator — implements features sprint by sprint, self-evaluates before QA handoff
Evaluator — uses browser automation to interact with the running application, grades against sprint contracts, files specific bugs

The three-agent version produced applications that were dramatically more functional than single-agent baselines — the solo agent’s core feature was broken, while the harness version’s core features worked.

Cost and tradeoffs

The harness is expensive. A retro game maker comparison:

Approach	Duration	Cost
Solo agent	20 min	$9
Full harness	6 hr	$200

Over 20x more expensive, but the quality gap justified it — the solo run produced a broken core feature while the harness run produced a working, polished application with AI integration.

Agent Harnesses — parent concept; the harness is the orchestration layer where this pattern lives
Criteria-Based Grading for Subjective Agent Output — the evaluation criteria that make the evaluator effective
Sprint Contracts - Negotiated Agent Agreements — how generator and evaluator align on “done” before work begins
Harness Simplification as Models Improve — how to strip scaffolding as model capabilities increase
Context Engineering — context resets were essential for earlier harness versions; compaction replaced them as models improved