Backtranslation as Benchmark Design
The Core Idea
Given a seed document, define a forward instruction and its inverse. A perfect model applies both and recovers the original exactly. Measuring similarity between seed and reconstructed document replaces the need for expert-annotated reference solutions.
This technique originated in machine translation as a data augmentation and quality evaluation method. Microsoft Researchβs DELEGATE-52 repurposes it to evaluate LLM document editing across 52 professional domains.
Why This Matters
Traditional benchmarks require expensive expert annotation, which limits scale. Backtranslation-based evaluation requires only:
- Seed documents β real documents from the target domain
- Invertible task pairs β forward and backward instructions
- Domain-specific similarity functions β custom parsers that compare structured representations
This makes it possible to build benchmarks in niche professional domains (crystallography, textile patterns, accounting ledgers) that would otherwise be prohibitively expensive to annotate.
Composability Into Long Simulations
Since each round-trip returns to the seed document, round-trips chain into relays. Ten consecutive round-trips create a 20-interaction simulation of extended delegated work. This composability reveals compounding effects that single-turn evaluations miss entirely.
Limitations
The method assumes editing tasks are genuinely invertible and that models attempt the instructions rather than taking shortcuts (e.g., outputting the input unchanged). DELEGATE-52 validates both assumptions experimentally. Generic similarity measures β including LLM-as-a-judge with GPT 5.4 β capture at most 25% of the variance that domain-specific parsers detect, making the parsing layer essential.
Key Takeaways
- Backtranslation converts any domain with invertible tasks into an annotation-free benchmark
- Round-trip composability enables long-horizon evaluation that reveals compounding degradation
- Domain-specific similarity functions are critical β generic metrics miss most of the signal
Related Notes
- LLM Document Degradation β the degradation patterns this methodology revealed
- Delegation Readiness and the Jagged Frontier β how benchmark results map to real-world trust decisions
- Evaluation Design Lifecycle - From Purpose to Metrics β the lifecycle process where backtranslation fits as a Phase 5 metric/method choice