🌰 seedling
Agentic vs Human Inference - Workload Profile Differences

Agentic vs Human Inference β€” Workload Profile Differences


Two Demand Profiles

Human promptsAgent tasks
PatternBursty, unpredictableContinuous, programmable, 24/7
LengthShort, one sessionLong, multistep, hours-to-days
ConcurrencyOne at a timeMany parallel sessions
Memory neededShort context, fits in HBMLong context + tool history, overflows to DRAM
Care aboutSpeed (first-token latency)Reliability, precision, persistent memory
Error toleranceForgivingCompounding β€” a 2% error rate across 50 tool calls fails the job most of the time

Human inference peaks at 9am Monday and dies at 3am Sunday. Agentic inference does not care about the clock.

How Each Stresses Hardware

The four hardware components have distinct roles:

  • CPU β€” sequential generalist; handles decisions, routing, tool calls, internet access
  • GPU β€” parallel math; matrix calculations only, no outside world
  • HBM (High Bandwidth Memory) β€” on-chip, extremely fast, capacity-limited by physical bonding
  • DRAM β€” off-chip, massive capacity, slow to access

Each workload stresses these differently:

WorkloadCPUGPUHBMDRAM
Trainingspectatordominantdominantunused
Human inferenceroute onlydominantdominantmostly idle
Agentic inferenceprimary playerdominantoverflowingactive

Training pins the GPU and HBM. Human inference relies almost entirely on HBM for fast first-token loads. Agentic inference is the only workload that stresses all four components simultaneously: HBM overflows into DRAM as long contexts and tool histories accumulate, and the CPU is in the loop on every external tool call β€” parsing results, formatting tokens, deciding what to write back to DRAM before handing control back to the GPU.

The result is constant CPU↔GPU and HBM↔DRAM handoff. That cross-component dance makes agentic inference the most demanding workload to serve.

Utilization Patterns

  • Training β€” flat, sustained utilization curve
  • Human inference β€” spiky; provision-for-peak means idle GPUs at off-hours, large opportunity cost
  • Agentic inference β€” high and continuous but pulsed by CPU handoffs; GPU goes momentarily idle whenever a tool call is in flight

The human profile leaves expensive silicon underutilized between peaks. The agentic profile keeps utilization high but does so by stressing components the hardware wasn’t designed to share with each other.

The Blur

In practice the line between the two profiles is blurring. The moment a human session involves a web search, an API call, or a document retrieval, the CPU activates and DRAM fills. The cleanly separated table above is more of an analytical aid than an observed dichotomy.

Why It Matters

Current GPU architectures, networking topologies, memory hierarchies, and serving stacks were all optimized for the human profile β€” high concurrency of short, bursty requests with low first-token latency. The agentic profile inverts most of those design constraints. Long-running, memory-hungry, tool-calling workloads on hardware tuned for the opposite create:

  • HBM spillover at scale β€” long contexts and tool histories no longer fit on-chip
  • Idle GPU cycles during CPU-bound tool calls
  • Open silicon opportunity for inference-specific chips designed around the agentic workload profile rather than retrofitted from training or human-inference designs

The chip market that exists today is the wrong shape for the workload that’s growing fastest.


Connected Notes