AI Memory Crowding — HBM Eats Consumer Device Budgets
Why HBM, not DRAM
The binding constraint on inference throughput is memory bandwidth, not compute FLOPs or capacity.
| Metric | HBM | DDR DRAM |
|---|---|---|
| Bandwidth | ~2.5 TB/s per stack | ~64-128 GB/s |
| Wafer area per bit | 3-4x more | 1x (baseline) |
| Cost per bit | Much higher | Lower |
| Value per bit in AI | Orders of magnitude higher | N/A for AI accelerators |
Switching to commodity DRAM would increase capacity per chip but leave compute cores idle waiting for data. Total tokens per dollar gets worse, not better.
The crowding mechanism
- DRAM vendors lost money in 2023 → delayed fab investment
- Prices recovered in 2024 when reasoning models + KV cache scaling made long-context mainstream
- New fabs take 2 years → meaningful capacity arrives late 2027-2028
- In the interim: AI demand claims an increasing share of fixed memory supply
- Consumer devices get squeezed — prices rise, volumes fall
Projected impact: smartphone volumes from 1.4B to 500-600M units. Xiaomi and Oppo already cutting low-end volumes by half. Memory vendors prefer AI contracts (longer terms, higher margins, more value per bit).
Investment implications
Memory vendors (SK Hynix, Samsung, Micron) benefit from the shift to HBM — higher margins per bit, longer contract terms, more predictable demand. Consumer electronics companies face BOM inflation that compresses margins or forces price increases. The transition is structural, not cyclical — AI’s memory appetite grows faster than new supply comes online.
2026-05 Update — The Dual Supercycle and CXL 3.0
Sriram Krishnan (AI Economics Part 2) frames HBM demand as occurring in two distinct supercycles, each with a different root cause:
- First supercycle — driven by training. Frontier models needed thousands of GPUs fed by HBM running uninterrupted weeks-long jobs. This was the supercycle that the existing fab buildout (and the consumer-crowding-out covered above) was responding to.
- Second supercycle — driven by agentic inference. Long context windows and growing task/tool histories overflow HBM and force constant spillover into DRAM. Where human inference sessions fit in HBM and discard quickly, agentic sessions hold growing state for hours. The agentic workload profile is the structural pull on a second HBM demand wave on top of the first.
This makes the HBM supply problem more durable than a single-wave training-cycle story would suggest. There is no end-of-training demand peak to wait out; agents are a structurally higher steady-state consumer of HBM than humans were.
The physical bonding constraint. HBM is bonded directly to the chip during packaging. You can only bond so much memory to a GPU before you run out of physical space — capacity per accelerator is bounded by package geometry, not just fab capacity. This is why DRAM (off-chip, scalable) cannot substitute for HBM despite being much cheaper per bit.
CXL 3.0 as the near-term fix. The most promising architectural workaround is Compute Express Link 3.0, which lets the CPU and GPU share a unified memory pool directly, eliminating the PCIe highway as a bottleneck. This would relax the HBM constraint by giving agentic workloads coherent access to a larger pooled memory rather than forcing spillover through slow PCIe paths. Commercial deployment at scale is 2-3 years out — too far away to ease the current crunch.
Related Notes
- EUV Lithography as the Binding Constraint on AI Scaling — the other hardware bottleneck, at the logic layer
- CUDA Programmability Moat - Why Flexibility Beats Optimization — why bandwidth matters more than raw FLOPs
- Inference Cost Collapse and Frontier Model Margin Expansion — inference economics depend on memory bandwidth
- AI and Investing Thesis — parent hub
- AI Stack Value Accrual - Chip, Infra, Intelligence, App — HBM crowding is the memory constraint on the chip layer of the value accrual stack
- Dylan Patel / SemiAnalysis — source