Criteria-Based Grading for Subjective Agent Output

The problem

“Is this design beautiful?” is hard to answer consistently. LLMs default to safe, predictable outputs that are technically functional but unremarkable. Without intervention, Claude gravitates toward template layouts, library defaults, and generic patterns.

The solution: decomposed criteria

Replace the vague question with concrete dimensions. For frontend design, Anthropic’s harness used four criteria:

Criterion	What it measures	Default model performance
Design quality	Coherent whole vs collection of parts — colors, typography, layout combine to create distinct mood	Weak — bland outputs
Originality	Evidence of custom decisions vs template/library/AI defaults	Weak — “AI slop” patterns
Craft	Technical execution — typography hierarchy, spacing, contrast	Strong by default
Functionality	Usability — can users find actions and complete tasks?	Strong by default

The key insight: weight dimensions where the model is weakest. Design quality and originality were weighted more heavily than craft and functionality because the model already scored well on the technical dimensions. The criteria explicitly penalized generic “AI slop” patterns (purple gradients over white cards, unmodified stock components), pushing the model toward aesthetic risk-taking.

Source: Harness design for long-running application development (Prithvi Rajasekaran, Anthropic, March 2026)

Calibration through few-shot examples

The evaluator was calibrated using few-shot examples with detailed score breakdowns. This served two purposes:

Aligned the evaluator’s judgment with the designer’s preferences
Reduced score drift across iterations

Without calibration, the evaluator’s grades wander. With it, the scoring becomes consistent enough to drive meaningful iteration.

Criteria as steering

The wording of criteria directly shaped output character in ways that were not fully anticipated. Including phrases like “the best designs are museum quality” pushed designs toward a particular visual convergence. This suggests criteria are not just measurement instruments — they are prompts that steer the generator’s creative direction.

Even on the first iteration, outputs with criteria were noticeably better than a no-prompting baseline. The criteria steered the model away from generic defaults before any evaluator feedback occurred.

Adapting criteria across domains

For full-stack development, the same pattern adapted to different dimensions:

Product depth — feature richness and completeness
Functionality — does it actually work when tested?
Visual design — coherent design language
Code quality — maintainability and correctness

Each criterion had a hard threshold. If any one fell below it, the sprint failed and the generator received detailed feedback on the specific failure. The evaluator used browser automation (Playwright) to interact with the live application, testing UI features, API endpoints, and database states.

Tuning the evaluator

Getting an evaluator to perform well required significant iteration. Out of the box, Claude as a QA agent:

Identified legitimate issues then talked itself into deciding they weren’t a big deal
Tested superficially rather than probing edge cases
Approved work that a human would flag

The tuning loop: read evaluator logs, find examples where its judgment diverged from human judgment, update the evaluator’s prompt to address those divergences. Several rounds of this were needed before the evaluator graded reasonably.

GAN-Inspired Agent Architecture - Generator Evaluator Loops — the architectural pattern these criteria serve
Agent Harnesses — harness design is where evaluation criteria live
AXI Principles - Agent-Ergonomic Interface Design — another approach to principled agent interface design, focused on tool interfaces rather than output quality