Technical Note · Inference Economics

Pricing the Quadratic

How LLM inference stays linearly priced when the underlying compute is O(n²) — and where the math gets fragile.

Published April 2026Reading time 9 minTopic Transformers · Pricing · Systems

Abstract

Frontier-model inference is priced linearly per token, while the dominant compute term — self-attention — scales as O(n²) in sequence length. Doubling the context window doubles the bill but quadruples the raw arithmetic.

The gap between linear pricing and quadratic compute is not absorbed loss. It is engineering: FlashAttention, GQA, sliding window, sparse and mixture-of-experts architectures, ring parallelism, prompt caching, and continuous batching collectively bend the effective curve toward sub-linear at the scales that matter for production traffic. What remains exposed — long uncached context with long generation — is the pricing surface most likely to see future repricing, throttling, or architectural displacement.

Figure 1. Cost per call as input context grows from 1K → 1M tokens, holding output at 2K. Linear pricing (cyan) reflects what users actually pay. The dashed line approximates effective compute under modern serving stacks. The magenta curve is the hypothetical bill if attention were billed at unoptimized O(n²) compute. The shaded region — roughly 2,400× at the long tail — is the value created by inference engineering.

§ 01The Apparent Paradox

A user calling a flagship model with 200,000 input tokens pays roughly one dollar. A user calling with 100,000 tokens pays about fifty cents. The pricing is linear; that price doubled when the context doubled.

The compute did not double. Self-attention requires every token in a sequence to attend to every other token, producing an n×n attention matrix. Arithmetic cost scales as O(n²). Doubling context did not double GPU work — it quadrupled it.

The pattern continues. Going from 100K to 1M tokens multiplies the bill 10×. It multiplies raw attention compute 100×. At the long tail — a 1M-context call on a flagship model — the user pays around five dollars for a workload whose unoptimized FLOP count would, if billed proportionally to compute, cost something on the order of thousands of dollars.

The price doubles. The arithmetic quadruples. This shouldn't work. It does. Understanding why is the entire question.

§ 02The True Cost Function

For a transformer call with n input tokens and m output tokens, attention compute scales along three terms:

FLOPs  ∝  n²          // prefill: input attends to itself
        +  n · m         // each output token attends to all n input tokens
        +  ½ m²         // output tokens attend to growing output

The 5–6× output price multiplier that every vendor charges — $25/MTok output vs $5/MTok input on a flagship Anthropic model, $30 vs $5 on a comparable OpenAI tier — captures decode being harder per token than prefill. Decode is sequential, memory-bandwidth bound, and gets poor GPU utilization. The multiplier handles that.

What it does not handle is the cross-term n·m: that each output token also performs O(n) attention work over the entire input. Output is priced linearly in m. The cross-term is invisible to the bill.

Three regimes follow from this asymmetry:

m « n (RAG, document Q&A) — prefill dominates; n² is the binding constraint.
m ≈ n (dialogue, code completion) — n² and ½ m² are roughly balanced.
m » n (essay generation, code synthesis from short prompts) — output self-attention ½ m² dominates.

Long-context generation is the worst case in both directions simultaneously.

§ 03What Naive Pricing Would Cost

Anchor a hypothetical “honest” price at the level of current pricing for a small call (n=1K, m=1K), then scale that anchor by the true compute formula. The result is the bill that would be charged if attention were billed proportionally to unoptimized FLOPs — a counterfactual, not a forecast.

Context (n)	Output (m)	Linear price	Naive O(n²)	Gap
10,000	2,000	$0.10	$1.46	15×
50,000	2,000	$0.30	$31.22	104×
100,000	2,000	$0.55	$122.42	222×
200,000	2,000	$1.05	$484.82	462×
500,000	2,000	$2.55	$3,012	1,181×
1,000,000	2,000	$5.05	$12,024	2,381×

Flagship-tier list pricing, single call, no caching. The gap is not a loss the vendor is eating. It is the value created by an engineering stack that has been compounding in production since roughly 2022. Naive O(n²) attention has not been the implementation in a frontier serving system for several years.

§ 04The Optimization Stack

What actually closes the gap, in roughly the order each mechanism reached production:

Mechanism	Effect	Saving type
FlashAttention `2022 · v1–v3`	Tiles attention to fit in GPU SRAM; never materializes the full n×n matrix in HBM.	Same FLOPs, ~3–10× wall-clock. Converts memory-bound to compute-bound.
Multi-Query / Grouped-Query Attention `2022–2023`	Shares K and V projections across query heads; KV cache shrinks 8–32×.	Massive memory reduction; significant decode speedup.
Sliding-window attention `2023`	Each token attends to a fixed local window of size w instead of the full sequence.	`O(n·w)` — genuinely sub-quadratic for window-bounded layers.
Sparse / block-sparse attention `2023+`	Attention computed only for selected token pairs (learned, structured, or routed).	FLOPs reduced proportionally to sparsity ratio.
Mixture of Experts `2023+`	Activates only a fraction of model parameters per token via a router.	Active params cut 4–8×; per-token FLOPs drop accordingly.
Hybrid SSM / attention `2024+`	Replaces some attention layers with state-space or linear-recurrence blocks.	Genuinely linear scaling for those layers (Mamba, Jamba, hybrid stacks).
Ring / context parallelism `2024+`	Shards the n×n attention computation across many GPUs in a ring.	Bounds wall-clock at the cost of more total hardware.
Speculative decoding `2023+`	A small draft model proposes tokens; the large model verifies them in parallel.	2–3× decode throughput.
Continuous batching `2023+`	Multiple users' calls share GPU cycles dynamically rather than blocking each other.	Marginal cost per call « dedicated cost. Amortization at the serving layer.
Prompt caching `2024+`	KV cache from a prefill is persisted and reused across calls with shared prefixes.	The 10× cache-read discount is the explicit price for “we already paid the n².”

No single mechanism is responsible. Modern inference stacks layer most of these simultaneously. The combined effect is that effective compute per call grows much closer to linear-with-n than quadratic-with-n at the scales where production traffic actually lives.

The 10× discount on cache reads relative to fresh input is the most explicit signal in the entire pricing surface. It is a literal price knob for “this is what the work costs when the prefill quadratic is skipped.” A 10× ratio implies prefill compute accounts for roughly 90% of the cost of a fresh input token at typical context lengths — which lines up with what you'd expect from a system where the n² term dominates and is the one mechanism caching most directly eliminates.

§ 05Where the Pricing Surface is Fragile

The optimization stack does not eliminate the quadratic. It bends the curve and pushes the elbow further out. Past the elbow, the math is still ugly.

Fragile workloads share three traits:

Long context, uncached. A fresh 500K-token input has no cache to reuse. The prefill quadratic is paid in full.
Long generation. Each output token does O(n) work, and ½ m² of self-attention compounds on top.
Sequential, latency-sensitive. Decode cannot be batched within a single request, and tight SLAs prevent backfilling idle GPU capacity from elsewhere.

Workloads that exhibit all three — full-codebase analysis with long output, deep research synthesis, agent loops with growing scratchpads — are the surface where vendor margins compress hardest. They are also the surface most likely to see future repricing, throttling under load, quality degradation, or aggressive push toward paid caching tiers.

Note what is not fragile: short prompts with short responses (most chat traffic), heavily cached agent calls (most production agent traffic), batched async workloads (data pipelines, summarization fleets, evaluation harnesses). These dominate volume. Inference is gross-margin positive at scale because the volume lives in the optimized regime, not the quadratic tail.

The optimization stack does not erase the quadratic. It pushes the elbow out. Past the elbow, the math is still ugly.

§ 06The Macro Picture

Three corrections to the popular narrative that “AI inference loses money”:

Inference unit economics are positive on the median call. Industry estimates put gross margins on flagship-tier API serving in roughly the 50–80% range at typical workload mixes. The wider distribution shows margin compression on free consumer products, where serving is partly subsidized by paid API revenue. The median paid call is profitable.
Training amortization is the loss center, not inference. Frontier model training runs cost on the order of $100M–$1B+ in compute alone. Those costs are amortized across the inference revenue earned over the model's deployment lifetime. A lab can be cash-flow negative at the company level while running margin-positive per call. The two questions are routinely conflated; they should not be.
Tokens per dollar improves roughly 10× per year. Despite frontier models trending larger, inference-side efficiency improvements — architectural, hardware, serving-stack — have outpaced model size scaling. The per-token API price tier available in 2026 was unreachable at any price in 2022. The trajectory is toward cheaper, not more expensive.

§ 07Implications

For builders.

Design agent and pipeline architectures to stay in the optimized regime. Cache aggressively. Avoid uncached long-context generation as a load-bearing pattern. Treat n·m and ½ m² as real cost terms even though the bill hides them. Stateless routing layers, short-context lead agents, and prefix-stable prompts all map directly onto cheap zones of the curve.

For pricing.

Expect tier differentiation, not uniform repricing. The 200K and 1M context tiers will continue to be priced and provisioned separately from base tiers. Caching discounts will deepen. Long-output workloads will increasingly route through batch APIs at meaningful discounts. The headline per-token rates may drift down even as the effective price for the fragile workload mix drifts up.

For the field.

Hybrid architectures — state-space models, sliding-window- dominant designs, sparse mixtures — will continue to displace dense attention at the long-context tail. The next-generation flagship that genuinely solves O(n²) at long context, rather than engineering around it, will reset the pricing surface again. Each such architectural transition compresses the gap between linear bill and quadratic compute by closing it from the bottom up.

The math is not broken. It is engineered around. The engineering is the product.

Notes & Method

Pricing figures use 2026 list rates for representative flagship, mid, and small tiers across major API vendors. The “Naive O(n²)” column is a counterfactual: total per-call compute is computed as n² + n·m + ½ m², anchored such that the small-call point (n=1K, m=1K) matches its actual linear price. The exercise is illustrative of scale, not predictive.
The “effective compute” curve in Figure 1 is illustrative. Real production stacks combine many of the optimizations in §4; an exact functional form is workload-, vendor-, and version-dependent and is not publicly disclosed by any frontier lab.
Output pricing's 5–6× multiplier captures decode being harder per token than prefill (sequential, memory-bandwidth bound). It does not capture per-output-token attention work over the input — the cross-term n·m. This is the most underappreciated cost surface in the current pricing structure.
FLOP estimates assume dense attention over d- dimensional hidden states, scaling as 4·n²·d per attention layer. Cross-vendor head counts, hidden dimensions, and layer counts vary; the analytical scaling law does not.
Margin estimates are summarized from publicly available analyst commentary and infrastructure cost disclosures; no proprietary vendor figures are used or implied.