Auto Research — Agents as Overnight Experimentation Engines
The setup
The pattern is deceptively simple:
- Objective: Define a measurable goal. Minimize validation loss on this benchmark. Maximize code pass rate on this test suite. Shrink the model without losing accuracy below threshold X.
- Boundaries: Define what the agent can and cannot change. You may modify hyperparameters, loss weights, and optimizer config. You may not change the architecture or training data.
- Budget: Define compute, time, and cost limits.
- Leave it running. Overnight, over a weekend, or however long the budget allows.
- Review at the end. The agent returns a ranked list of candidate improvements with evidence.
The discipline is refusing to micromanage. Every time the human intervenes to nudge the agent toward a “promising” direction, the experimentation loop slows down and inherits the human’s biases about what will work.
The Karpathy GPT-2 example
Karpathy applied this to his own GPT-2 training repo — code he had hand-tuned over literal decades. The agent, left to run, found:
- Weight decay on value embeddings had not been correctly applied (the agent discovered this by ablating)
- Adam beta parameters were improperly tuned for his particular setup
- Joint interactions between hyperparameters that he had missed because he’d only tuned one at a time
The embarrassing-in-a-good-way result: code that had been the product of years of careful manual work still had low-hanging fruit an agent could find in one overnight run. The implication is that most hand-tuned research code has more of this hiding in it than researchers want to admit.
His framing: “I shouldn’t be a bottleneck.”
What auto research needs to work
- Clean metrics. If the objective can’t be measured in a single number (or a small vector), the agent can’t know which experiments won. Research domains with noisy, subjective, or multi-dimensional outcomes are harder to automate.
- Fast iteration cycle. If each experiment takes a week, auto research becomes scheduled-batch research. The payoff scales with how many cycles you can run per budget.
- Reproducibility. The agent needs to trust that a given config produces the same result twice, or it can’t reason about which change caused which improvement.
- Bounded search space. Open-ended “go improve this repo” tasks fail because the agent wanders. Constrained “tune these 12 hyperparameters” tasks succeed because the search is small enough to sweep.
- Safety/cost guardrails. Compute budgets must be enforced by the harness, not by asking the agent nicely.
What changes when this works
- Human hours per experiment approaches zero. The researcher’s role shifts from running experiments to specifying them and reviewing results.
- Research throughput becomes compute-bound. The question is no longer “how fast can a researcher iterate” but “how much compute can we throw at the loop.”
- Hand-tuned artifacts become suspicious. If an agent can find improvements to an expert’s careful work in one night, any artifact that wasn’t subjected to auto-research probably has the same kind of hidden low-hanging fruit.
- Researchers specialize up the stack. Instead of tuning models, researchers define objectives, curate evaluation sets, and design the search spaces. The mechanical work moves below them.
The limits to watch
- Metric hacking. Agents optimize what you measure. If the metric has a loophole, the agent finds it. Requires careful metric design and adversarial review.
- Compute asymmetry. Auto research favors whoever has more compute. This may accelerate the gap between well-resourced labs and everyone else — though distributed approaches like Bittensor-style training may counter this.
- Non-verifiable domains. Any task without a clean metric (writing quality, research judgment, taste) stays human-led even as mechanical experimentation goes automated.
Related Notes
- Research Org as Tunable program dot md
- Token Throughput as the New Coding Bottleneck
- Distributed Open Source AI Training as Orthogonal Threat
- Karpathy - No Priors Code Agents Autoresearch