🌰 seedling
Auto Research - Agents as Overnight Experimentation Engines

Auto Research — Agents as Overnight Experimentation Engines

The setup

The pattern is deceptively simple:

  1. Objective: Define a measurable goal. Minimize validation loss on this benchmark. Maximize code pass rate on this test suite. Shrink the model without losing accuracy below threshold X.
  2. Boundaries: Define what the agent can and cannot change. You may modify hyperparameters, loss weights, and optimizer config. You may not change the architecture or training data.
  3. Budget: Define compute, time, and cost limits.
  4. Leave it running. Overnight, over a weekend, or however long the budget allows.
  5. Review at the end. The agent returns a ranked list of candidate improvements with evidence.

The discipline is refusing to micromanage. Every time the human intervenes to nudge the agent toward a “promising” direction, the experimentation loop slows down and inherits the human’s biases about what will work.

The Karpathy GPT-2 example

Karpathy applied this to his own GPT-2 training repo — code he had hand-tuned over literal decades. The agent, left to run, found:

  • Weight decay on value embeddings had not been correctly applied (the agent discovered this by ablating)
  • Adam beta parameters were improperly tuned for his particular setup
  • Joint interactions between hyperparameters that he had missed because he’d only tuned one at a time

The embarrassing-in-a-good-way result: code that had been the product of years of careful manual work still had low-hanging fruit an agent could find in one overnight run. The implication is that most hand-tuned research code has more of this hiding in it than researchers want to admit.

His framing: “I shouldn’t be a bottleneck.”

What auto research needs to work

  • Clean metrics. If the objective can’t be measured in a single number (or a small vector), the agent can’t know which experiments won. Research domains with noisy, subjective, or multi-dimensional outcomes are harder to automate.
  • Fast iteration cycle. If each experiment takes a week, auto research becomes scheduled-batch research. The payoff scales with how many cycles you can run per budget.
  • Reproducibility. The agent needs to trust that a given config produces the same result twice, or it can’t reason about which change caused which improvement.
  • Bounded search space. Open-ended “go improve this repo” tasks fail because the agent wanders. Constrained “tune these 12 hyperparameters” tasks succeed because the search is small enough to sweep.
  • Safety/cost guardrails. Compute budgets must be enforced by the harness, not by asking the agent nicely.

What changes when this works

  • Human hours per experiment approaches zero. The researcher’s role shifts from running experiments to specifying them and reviewing results.
  • Research throughput becomes compute-bound. The question is no longer “how fast can a researcher iterate” but “how much compute can we throw at the loop.”
  • Hand-tuned artifacts become suspicious. If an agent can find improvements to an expert’s careful work in one night, any artifact that wasn’t subjected to auto-research probably has the same kind of hidden low-hanging fruit.
  • Researchers specialize up the stack. Instead of tuning models, researchers define objectives, curate evaluation sets, and design the search spaces. The mechanical work moves below them.

The limits to watch

  • Metric hacking. Agents optimize what you measure. If the metric has a loophole, the agent finds it. Requires careful metric design and adversarial review.
  • Compute asymmetry. Auto research favors whoever has more compute. This may accelerate the gap between well-resourced labs and everyone else — though distributed approaches like Bittensor-style training may counter this.
  • Non-verifiable domains. Any task without a clean metric (writing quality, research judgment, taste) stays human-led even as mechanical experimentation goes automated.
Connected Notes