Criteria-Based Grading for Subjective Agent Output
The problem
“Is this design beautiful?” is hard to answer consistently. LLMs default to safe, predictable outputs that are technically functional but unremarkable. Without intervention, Claude gravitates toward template layouts, library defaults, and generic patterns.
The solution: decomposed criteria
Replace the vague question with concrete dimensions. For frontend design, Anthropic’s harness used four criteria:
| Criterion | What it measures | Default model performance |
|---|---|---|
| Design quality | Coherent whole vs collection of parts — colors, typography, layout combine to create distinct mood | Weak — bland outputs |
| Originality | Evidence of custom decisions vs template/library/AI defaults | Weak — “AI slop” patterns |
| Craft | Technical execution — typography hierarchy, spacing, contrast | Strong by default |
| Functionality | Usability — can users find actions and complete tasks? | Strong by default |
The key insight: weight dimensions where the model is weakest. Design quality and originality were weighted more heavily than craft and functionality because the model already scored well on the technical dimensions. The criteria explicitly penalized generic “AI slop” patterns (purple gradients over white cards, unmodified stock components), pushing the model toward aesthetic risk-taking.
Source: Harness design for long-running application development (Prithvi Rajasekaran, Anthropic, March 2026)
Calibration through few-shot examples
The evaluator was calibrated using few-shot examples with detailed score breakdowns. This served two purposes:
- Aligned the evaluator’s judgment with the designer’s preferences
- Reduced score drift across iterations
Without calibration, the evaluator’s grades wander. With it, the scoring becomes consistent enough to drive meaningful iteration.
Criteria as steering
The wording of criteria directly shaped output character in ways that were not fully anticipated. Including phrases like “the best designs are museum quality” pushed designs toward a particular visual convergence. This suggests criteria are not just measurement instruments — they are prompts that steer the generator’s creative direction.
Even on the first iteration, outputs with criteria were noticeably better than a no-prompting baseline. The criteria steered the model away from generic defaults before any evaluator feedback occurred.
Adapting criteria across domains
For full-stack development, the same pattern adapted to different dimensions:
- Product depth — feature richness and completeness
- Functionality — does it actually work when tested?
- Visual design — coherent design language
- Code quality — maintainability and correctness
Each criterion had a hard threshold. If any one fell below it, the sprint failed and the generator received detailed feedback on the specific failure. The evaluator used browser automation (Playwright) to interact with the live application, testing UI features, API endpoints, and database states.
Tuning the evaluator
Getting an evaluator to perform well required significant iteration. Out of the box, Claude as a QA agent:
- Identified legitimate issues then talked itself into deciding they weren’t a big deal
- Tested superficially rather than probing edge cases
- Approved work that a human would flag
The tuning loop: read evaluator logs, find examples where its judgment diverged from human judgment, update the evaluator’s prompt to address those divergences. Several rounds of this were needed before the evaluator graded reasonably.
Related Notes
- GAN-Inspired Agent Architecture - Generator Evaluator Loops — the architectural pattern these criteria serve
- Agent Harnesses — harness design is where evaluation criteria live
- AXI Principles - Agent-Ergonomic Interface Design — another approach to principled agent interface design, focused on tool interfaces rather than output quality