Runtime Shift β A New Form of Distributional Shift for Agents
The mechanism
An agent does not learn an abstract concept of βshell.β It learns the specific shell it trained on β how ls formats output, how the browser resolves redirects, which commands block versus return instantly, which snapshots exist. The model bakes all of this into the policy during training (or the optimizer bakes it into the harness and prompts).
Move to a different runtime and:
- Tools that were idempotent become flaky
- Commands that were instant now block
- Snapshots that existed disappear
- Error messages change format
This is not data shift β the input distribution the MLOps community has worried about for a decade. It is environment shift, and it produces silent quality regressions that standard evals never catch because the eval runs in the training environment.
Three honest mitigations
-
Co-locate train and prod on the same runtime. Pick one sandbox provider, run RL rollouts on it, run production sessions on it, accept the lock-in. The most disciplined teams do this.
-
Define a runtime contract and enforce it on both sides. A small, versioned interface β βthis is the shell, these are the tools, these are the latencies, these are the failure modesβ β implemented once on training infrastructure and once on production infrastructure. The contract must cover timing and failure semantics, not just API surface.
-
Train against production noise. Inject latency, errors, and tool failures during training so the policy is robust to runtime variance. Step-DeepResearch reports tangible gains from 5β10% tool errors during training. This is the runtime analog of dropout.
The wrong answer β the one most teams unconsciously pick β is to choose a sandbox for production as a pure software engineering decision, ignoring the ML requirements, then spend quarters chasing agent flakiness.
Source: Hidden Technical Debt of AI Systems: Agent Runtime β Lee Hanchung, April 2026
Related Notes
- Agent Harnesses β the harness is what the runtime hosts; runtime shift means the harness trained in one environment may not transfer cleanly
- Context Engineering β context management assumes stable tool behavior; runtime shift breaks that assumption
- Hidden Technical Debt of AI Systems - Agent Runtime β source clipping