RL Scaling Follows Pre-Training — The Generalization Inflection Ahead
The analogy
Pre-training’s trajectory:
- GPT-1 (2018) — narrow corpus, poor generalization
- GPT-2 (2019) — broad internet data, sudden generalization across tasks
- GPT-3 onward — scaling laws hold, capability improves log-linearly with compute
RL’s current trajectory:
- Narrow RL (2024-2025) — trained on math competitions, coding challenges, specific verifiable tasks
- Broad RL (expected) — training on diverse task distributions, sudden generalization across domains
- RL scaling laws — already visible on AIME benchmarks, log-linear improvement with training duration
The inflection (step 1→2) happened for pre-training when the data distribution broadened. Amodei expects the same transition for RL as task distributions broaden beyond math and coding.
Why this matters
Pre-training scaling gave us general language capability. RL scaling gives us general action capability — the ability to pursue goals, make multi-step decisions, and recover from errors across diverse domains. The combination produces agents that both understand context (pre-training) and execute toward objectives (RL).
For verifiable tasks (math, coding, formal proofs): RL scaling already produces state-of-the-art results. Amodei predicts full end-to-end software engineering capability within 1-2 years.
For non-verifiable tasks (writing, strategy, planning): more uncertainty, but the generalization pattern from pre-training suggests that once the RL task distribution broadens enough, capability extends to these domains too.
The sample efficiency question
Rich Sutton’s objection: if models had a “true core of human learning,” they wouldn’t need billions of dollars of compute to learn simple tasks. Amodei’s response: models start from random weights and must do the equivalent of both evolutionary learning and individual learning during training. Humans start with brains shaped by millions of years of evolution. Once trained, models with million-token context windows show genuine in-context adaptation comparable to weeks of human reading.
The implication: high upfront training cost, low marginal deployment cost. Each new model amortizes its training across all users. This favors a few large labs running expensive training with cheap inference, consistent with the oligopoly structure described in Two Exponentials - AI Capability vs Economic Diffusion.
Related Notes
- Two Exponentials - AI Capability vs Economic Diffusion — the capability exponential that RL scaling feeds
- Three Waves of AI Opportunity - Unhobbling, Physical Interface, Robotics — Wave 1 tasks are where RL scaling hits first
- Token Throughput as the New Coding Bottleneck — coding as the first domain where RL scaling produces full automation
- Inference Cost Collapse and Frontier Model Margin Expansion — the economics of high training cost + cheap inference
- Dario Amodei on Dwarkesh Patel — source