KV Cache Compression — The Primary Bottleneck in Long-Context Inference

Why the KV Cache Is the Bottleneck

Cache size scales as: layers x KV_heads x head_dim x sequence_length x 2

At 128K context, this equation dominates the memory budget before model weights even load. On an 8 GB phone, it is fatal without compression.

Four Compression Strategies

1. Grouped-Query Attention (GQA)

Multiple query heads share a single KV head. An 8:1 ratio cuts cache by 8x. Can be applied uniformly (edge) or selectively by layer type (server).

Later layers in a trained network produce converging representations. Instead of computing fresh K/V projections at every layer, some layers reuse K/V from earlier layers. Gemma 4 E2B shares across 20 of 35 layers, eliminating cache for those layers entirely.

Trade-off: Shared layers lose the ability to sharpen their own retrieval targets. Compensation requires doubling MLP width — trading compute (cheap on phones) for memory (scarce on phones).

3. Multi-Head Latent Attention (DeepSeek MLA)

Projects token information into a compressed latent vector appended to the cache. Reconstructs full K/V on the fly during attention computation. DeepSeek reported 93.3% cache reduction with this approach.

4. Interleaved Local-Global Attention

Most layers use sliding-window attention (cache limited to window size). Only a fraction of layers compute full global attention requiring cache across the entire context. Gemma 4 uses 5:1 local-to-global ratios.

Combined Effect

Stacking these techniques produces order-of-magnitude reductions. Gemma 4 E2B combines cross-layer sharing + 8:1 GQA + interleaved attention to shrink cache from tens of gigabytes to hundreds of megabytes at 128K context. This is the only reason long-context models run on 8 GB phones.

Industry Convergence

Different labs attack the same target from different angles:

DeepSeek: latent compression within layers (MLA)
Gemma 4: cross-layer sharing across layers
Mamba: selective state updates (avoiding attention entirely)
Ring Attention: partitioning across devices

The architecture that figures out exactly which tokens need expensive attention wins.

Edge vs Server Model Architecture - Why One DNA Cannot Serve Both — why compression strategies differ by deployment target
Per-Layer Embeddings - Trading Flash for DRAM on Edge Models — another edge memory optimization
AI Memory Crowding - HBM Eats Consumer Device Budgets — the economics of memory scarcity
Google Gemma 4 Will Change How AI Is Deployed — source article