🌰 seedling
Criteria-Based Grading for Subjective Agent Output

Criteria-Based Grading for Subjective Agent Output


The problem

“Is this design beautiful?” is hard to answer consistently. LLMs default to safe, predictable outputs that are technically functional but unremarkable. Without intervention, Claude gravitates toward template layouts, library defaults, and generic patterns.

The solution: decomposed criteria

Replace the vague question with concrete dimensions. For frontend design, Anthropic’s harness used four criteria:

CriterionWhat it measuresDefault model performance
Design qualityCoherent whole vs collection of parts — colors, typography, layout combine to create distinct moodWeak — bland outputs
OriginalityEvidence of custom decisions vs template/library/AI defaultsWeak — “AI slop” patterns
CraftTechnical execution — typography hierarchy, spacing, contrastStrong by default
FunctionalityUsability — can users find actions and complete tasks?Strong by default

The key insight: weight dimensions where the model is weakest. Design quality and originality were weighted more heavily than craft and functionality because the model already scored well on the technical dimensions. The criteria explicitly penalized generic “AI slop” patterns (purple gradients over white cards, unmodified stock components), pushing the model toward aesthetic risk-taking.

Source: Harness design for long-running application development (Prithvi Rajasekaran, Anthropic, March 2026)

Calibration through few-shot examples

The evaluator was calibrated using few-shot examples with detailed score breakdowns. This served two purposes:

  1. Aligned the evaluator’s judgment with the designer’s preferences
  2. Reduced score drift across iterations

Without calibration, the evaluator’s grades wander. With it, the scoring becomes consistent enough to drive meaningful iteration.

Criteria as steering

The wording of criteria directly shaped output character in ways that were not fully anticipated. Including phrases like “the best designs are museum quality” pushed designs toward a particular visual convergence. This suggests criteria are not just measurement instruments — they are prompts that steer the generator’s creative direction.

Even on the first iteration, outputs with criteria were noticeably better than a no-prompting baseline. The criteria steered the model away from generic defaults before any evaluator feedback occurred.

Adapting criteria across domains

For full-stack development, the same pattern adapted to different dimensions:

  • Product depth — feature richness and completeness
  • Functionality — does it actually work when tested?
  • Visual design — coherent design language
  • Code quality — maintainability and correctness

Each criterion had a hard threshold. If any one fell below it, the sprint failed and the generator received detailed feedback on the specific failure. The evaluator used browser automation (Playwright) to interact with the live application, testing UI features, API endpoints, and database states.

Tuning the evaluator

Getting an evaluator to perform well required significant iteration. Out of the box, Claude as a QA agent:

  • Identified legitimate issues then talked itself into deciding they weren’t a big deal
  • Tested superficially rather than probing edge cases
  • Approved work that a human would flag

The tuning loop: read evaluator logs, find examples where its judgment diverged from human judgment, update the evaluator’s prompt to address those divergences. Several rounds of this were needed before the evaluator graded reasonably.


Connected Notes