🌰 seedling
Edge vs Server Model Architecture - Why One DNA Cannot Serve Both

Edge vs Server Model Architecture β€” Why One DNA Cannot Serve Both


The Constraint Flip

ResourceEdge (Phone)Server (H100)
Storage128 GB flash (cheap)Not a factor
DRAM8 GB shared (scarce)80 GB HBM (abundant)
ComputeBattery-limited (scarce)Pay-per-hour (abundant)
Binding constraintMemoryCompute cost

Every architectural choice is a trade between memory, storage, and compute. When the scarcity flips, you pull opposite levers.

How This Plays Out in Practice

Google’s Gemma 4 family demonstrates this clearly. Under one model family name, the edge pair (E2B, E4B) and server pair (26B, 31B) share almost no architectural DNA:

  • Edge models park 46% of parameters in flash via Per-Layer Embeddings, compress the KV cache by 83%+ via cross-layer sharing, and double MLP width to compensate β€” spending compute to save memory.
  • Server models skip PLE entirely, use selective GQA compression only on coarse global layers, and keep all layers computing independent KV β€” spending memory to save compute.

The Broader Trend

Uniform scaling β€” pushing the same architecture at different sizes β€” is giving way to hardware-specialized design. This pattern is emerging across the AI stack:

  • Chips: inference and training silicon are bifurcating
  • Tokenization: text-style tokenizers are failing for vision, requiring purpose-built approaches
  • Reasoning: monolithic reasoning is splitting into specialized subsystems

The labs that win the next cycle will not just scale brute compute. They will engineer architectures that exploit the specific physics of their target hardware.


Connected Notes