Agentic vs Human Inference — Workload Profile Differences

Two Demand Profiles

	Human prompts	Agent tasks
Pattern	Bursty, unpredictable	Continuous, programmable, 24/7
Length	Short, one session	Long, multistep, hours-to-days
Concurrency	One at a time	Many parallel sessions
Memory needed	Short context, fits in HBM	Long context + tool history, overflows to DRAM
Care about	Speed (first-token latency)	Reliability, precision, persistent memory
Error tolerance	Forgiving	Compounding — a 2% error rate across 50 tool calls fails the job most of the time

Human inference peaks at 9am Monday and dies at 3am Sunday. Agentic inference does not care about the clock.

How Each Stresses Hardware

The four hardware components have distinct roles:

CPU — sequential generalist; handles decisions, routing, tool calls, internet access
GPU — parallel math; matrix calculations only, no outside world
HBM (High Bandwidth Memory) — on-chip, extremely fast, capacity-limited by physical bonding
DRAM — off-chip, massive capacity, slow to access

Each workload stresses these differently:

Workload	CPU	GPU	HBM	DRAM
Training	spectator	dominant	dominant	unused
Human inference	route only	dominant	dominant	mostly idle
Agentic inference	primary player	dominant	overflowing	active

Training pins the GPU and HBM. Human inference relies almost entirely on HBM for fast first-token loads. Agentic inference is the only workload that stresses all four components simultaneously: HBM overflows into DRAM as long contexts and tool histories accumulate, and the CPU is in the loop on every external tool call — parsing results, formatting tokens, deciding what to write back to DRAM before handing control back to the GPU.

The result is constant CPU↔GPU and HBM↔DRAM handoff. That cross-component dance makes agentic inference the most demanding workload to serve.

Utilization Patterns

Training — flat, sustained utilization curve
Human inference — spiky; provision-for-peak means idle GPUs at off-hours, large opportunity cost
Agentic inference — high and continuous but pulsed by CPU handoffs; GPU goes momentarily idle whenever a tool call is in flight

The human profile leaves expensive silicon underutilized between peaks. The agentic profile keeps utilization high but does so by stressing components the hardware wasn’t designed to share with each other.

The Blur

In practice the line between the two profiles is blurring. The moment a human session involves a web search, an API call, or a document retrieval, the CPU activates and DRAM fills. The cleanly separated table above is more of an analytical aid than an observed dichotomy.

Why It Matters

Current GPU architectures, networking topologies, memory hierarchies, and serving stacks were all optimized for the human profile — high concurrency of short, bursty requests with low first-token latency. The agentic profile inverts most of those design constraints. Long-running, memory-hungry, tool-calling workloads on hardware tuned for the opposite create:

HBM spillover at scale — long contexts and tool histories no longer fit on-chip
Idle GPU cycles during CPU-bound tool calls
Open silicon opportunity for inference-specific chips designed around the agentic workload profile rather than retrofitted from training or human-inference designs

The chip market that exists today is the wrong shape for the workload that’s growing fastest.

AI is power and compute constrained — parent constraint thesis
AI Memory Crowding - HBM Eats Consumer Device Budgets — the memory chokepoint that the agentic workload makes worse
CPU Bottleneck - The Hidden Constraint on AI Scaling — a parallel CPU shortage thesis from a different angle (RL environments + inference diffusion)
KV Cache Compression - The Primary Bottleneck in Long-Context Inference — the specific memory bottleneck inside long agentic sessions
Edge vs Server Model Architecture - Why One DNA Cannot Serve Both — same logic at a different layer: one workload profile cannot be served well by hardware designed for another
CUDA Programmability Moat - Why Flexibility Beats Optimization — programmability matters more for diverse workloads like agents than for narrow training jobs
Inference Cost Collapse and Frontier Model Margin Expansion — the cost trajectory of inference depends heavily on which workload profile dominates
Sriram Krishnan — AI Economics Part 2 — source