CUDA Programmability Moat — Why Flexibility Beats Optimization

The structural argument

A TPU or custom ASIC takes a known computation pattern (matrix multiplies in predictable shapes) and optimizes the silicon for exactly that pattern. When the workload matches, the ASIC wins on efficiency. When the workload changes — and in AI, workloads change every 6-12 months as researchers invent new methods — the ASIC becomes a stranded asset.

CUDA GPUs trade peak efficiency for programmability. Any researcher with a new idea can write CUDA kernels and test it on existing hardware. The ecosystem compounds: CUDA libraries get optimized, frameworks build on those libraries, researchers build on those frameworks. Each layer is a switching cost.

Jensen Huang frames the 30-50x Hopper→Blackwell improvement as proof:

Moore’s Law contributed ~75% transistor-level gain over 3 years
The remaining 40-65x came from Mixture of Experts, new numerics, co-design across processors/fabric/libraries — all requiring CUDA’s programmability
An ASIC optimized for Hopper-era workloads would miss most of that gain

Why the ASIC cost advantage is narrower than it appears

Three factors compress the theoretical savings:

ASIC margins are nearly as high as Nvidia’s. Google’s TPUs and Amazon’s Trainium aren’t sold at cost. The silicon savings don’t flow entirely to the customer.
Nvidia engineers embedded at AI labs. Nvidia places engineers at frontier labs to optimize their stacks, sometimes unlocking 2-3x speedups. This services layer doesn’t show up in hardware benchmarks.
InferenceMAX benchmark challenge. Huang points to Dylan Patel’s open benchmark as a standing invitation for TPU and Trainium teams to demonstrate their cost advantage. Neither has accepted publicly.

The ecosystem lock-in

CUDA’s moat compounds through layers:

Libraries (cuBLAS, cuDNN, TensorRT) — optimized over a decade
Frameworks (PyTorch, JAX) — built on CUDA primitives
Researcher muscle memory — every ML PhD trained on CUDA
Annual cadence (Vera Rubin → Vera Rubin Ultra → Feynman) — predictable upgrade path at any scale from single rack to $100B AI factory

The comparison Huang draws: only TSMC in the foundry world offers the same reliability-at-any-scale guarantee.

Connection to the stack value question

AI Stack Value Accrual - Chip, Infra, Intelligence, App asks where value concentrates in the AI stack. The CUDA moat argument says the chip layer captures durable value because:

Programmability makes the moat algorithmic, not just manufacturing
The annual cadence makes the advantage compounding, not static
The ecosystem makes switching costs systemic, not just technical

This is consistent with Nvidia’s revenue trajectory and margin expansion despite competition from Google, Amazon, and Huawei.

The Jevons connection

Huang’s “agents multiply tool users” argument is a specific Jevons prediction: AI agents won’t replace software tools, they’ll run more instances of them. Just as ATMs didn’t kill bank tellers (branches got cheaper → more branches), AI agents operating Synopsys Design Compiler will mean more licenses running, not fewer. This directly supports the Jevons side of Jevons Paradox vs Cognitive Displacement - The Unresolved Tension for at least the tool-usage layer.

AI Stack Value Accrual - Chip, Infra, Intelligence, App — where the chip layer fits in the stack
Inference Cost Collapse and Frontier Model Margin Expansion — the economics of the inference layer Nvidia serves
Jevons Paradox vs Cognitive Displacement - The Unresolved Tension — Huang’s “agents multiply tool users” supports the Jevons side
Two Exponentials - AI Capability vs Economic Diffusion — Nvidia’s supply chain persuasion is about aligning the upstream to the capability exponential
Provincial Competition Explains China’s Execution Edge — the China/Huawei dynamic Huang argues against export controls
EUV Lithography as the Binding Constraint on AI Scaling — design-layer moat (CUDA) sits on top of manufacturing-layer constraint (EUV)
AI Memory Crowding - HBM Eats Consumer Device Budgets — CUDA’s flexibility is useless if memory bandwidth is the bottleneck
Jensen Huang interview — source