KV Cache Compression — The Primary Bottleneck in Long-Context Inference
Why the KV Cache Is the Bottleneck
Cache size scales as: layers x KV_heads x head_dim x sequence_length x 2
At 128K context, this equation dominates the memory budget before model weights even load. On an 8 GB phone, it is fatal without compression.
Four Compression Strategies
1. Grouped-Query Attention (GQA)
Multiple query heads share a single KV head. An 8:1 ratio cuts cache by 8x. Can be applied uniformly (edge) or selectively by layer type (server).
2. Cross-Layer KV Sharing
Later layers in a trained network produce converging representations. Instead of computing fresh K/V projections at every layer, some layers reuse K/V from earlier layers. Gemma 4 E2B shares across 20 of 35 layers, eliminating cache for those layers entirely.
Trade-off: Shared layers lose the ability to sharpen their own retrieval targets. Compensation requires doubling MLP width — trading compute (cheap on phones) for memory (scarce on phones).
3. Multi-Head Latent Attention (DeepSeek MLA)
Projects token information into a compressed latent vector appended to the cache. Reconstructs full K/V on the fly during attention computation. DeepSeek reported 93.3% cache reduction with this approach.
4. Interleaved Local-Global Attention
Most layers use sliding-window attention (cache limited to window size). Only a fraction of layers compute full global attention requiring cache across the entire context. Gemma 4 uses 5:1 local-to-global ratios.
Combined Effect
Stacking these techniques produces order-of-magnitude reductions. Gemma 4 E2B combines cross-layer sharing + 8:1 GQA + interleaved attention to shrink cache from tens of gigabytes to hundreds of megabytes at 128K context. This is the only reason long-context models run on 8 GB phones.
Industry Convergence
Different labs attack the same target from different angles:
- DeepSeek: latent compression within layers (MLA)
- Gemma 4: cross-layer sharing across layers
- Mamba: selective state updates (avoiding attention entirely)
- Ring Attention: partitioning across devices
The architecture that figures out exactly which tokens need expensive attention wins.
Related Notes
- Edge vs Server Model Architecture - Why One DNA Cannot Serve Both — why compression strategies differ by deployment target
- Per-Layer Embeddings - Trading Flash for DRAM on Edge Models — another edge memory optimization
- AI Memory Crowding - HBM Eats Consumer Device Budgets — the economics of memory scarcity
- Google Gemma 4 Will Change How AI Is Deployed — source article