KV-cache scaling problem in LLM inference
Core idea
During autoregressive decoding, KV-cache introduces a second fundamental scaling limit - independent of model weights.
While weight streaming creates a constant bandwidth cost per token,
KV-cache creates a state that grows linearly with sequence length and number of users.
At sufficiently long contexts, KV-cache becomes the dominant constraint in:
- memory capacity
- memory bandwidth
- inter-GPU communication
- achievable throughput (UTPS / STPS)
Mechanism
In transformer decoding, every generated token produces:
- new Key vector
- new Value vector
for every attention layer.
These vectors must be stored because future tokens attend to all previous tokens.
Therefore KV-cache grows with:
- context length (L)
- number of layers (N)
- model hidden size
- number of concurrent users (B)
KV-cache scaling formula
Approximate memory usage:
KV_bytes ≈ 2 × B × L × N × H_kv × d_kv × bytes_per_element
Where:
2→ Keys + ValuesB→ number of users (batch)L→ sequence lengthN→ number of transformer layersH_kv→ number of KV headsd_kv→ head dimensionbytes_per_element→ precision (FP16=2, INT8=1, INT4≈0.5)
Simplified approximation:
KV_bytes ≈ 2 × B × L × N × d_model × bytes
Because:
H_kv × d_kv ≈ d_model
Example intuition
Large model (~70B)
- context = 64k
- FP16 KV
- 1 user
→ KV-cache ≈ ~150–180 GB
Small model (~7B)
- context = 64k
- FP16 KV
- 1 user
→ KV-cache ≈ ~30–35 GB
Concurrency scaling
KV-cache scales linearly with users:
- 8 users → ~8× memory
- 32 users → ~32× memory
This makes multi-user inference extremely memory-intensive.
Why KV-cache becomes dominant
Phase 1 - weights dominate
At short context:
- weight streaming bandwidth cost dominates
- KV-cache is small
- throughput limited by HBM bandwidth
Phase 2 - crossover point
At longer context:
- KV-cache ≈ weight size
- memory pressure increases
- scheduling becomes harder
Phase 3 - KV-cache dominates
At very long context:
- KV-cache is main memory consumer
- bandwidth spent on cache reads/writes
- cross-device communication increases
- throughput collapses
This creates a second scaling wall.
Bandwidth impact
KV-cache increases:
- attention read traffic
- write traffic per token
- GPU memory fragmentation
- NUMA traffic
- NVLink / PCIe / InfiniBand usage
Attention complexity becomes:
O(L)
per generated token.
Thus:
- longer context → higher latency
- longer context → lower tokens/sec
- longer context → worse scaling efficiency
Distributed inference impact
When KV-cache is sharded across GPUs:
- each token requires cross-device attention data
- all-gather / reduce operations increase
- synchronization latency becomes dominant
At high bandwidth hardware:
interconnect latency - not compute - becomes the bottleneck
This explains why:
- adding GPUs does not scale linearly
- ultra-fast memory alone does not solve inference scaling
Relationship to main bandwidth limit
Inference has two fundamental limits
Limit 1 - weight streaming
tokens/sec ≈ memory_bandwidth / bytes_weights_per_token
Constant per token.
Limit 2 - KV-cache growth
memory_usage ≈ B × L
Dynamic over time.
Together they define:
- maximum context
- maximum concurrency
- achievable throughput
System design implications
KV-cache scaling motivates:
- KV quantization
- KV compression (low-rank / eviction)
- sliding window attention
- grouped query attention (GQA)
- mixture-of-experts routing
- speculative decoding
- state-space models (SSM)
- retrieval instead of long context
- SRAM-heavy accelerators
Key insight
LLM inference scaling is limited not by math, but by state.
Weights are static.
KV-cache is growing state.
In large systems:
managing state movement becomes harder than performing computation.
Connections
- llm_inference_bandwidth_limit
- transformer_autoregressive_decoding
- attention_complexity_scaling
- multi_gpu_collective_latency
- kv_cache_quantization
- speculative_decoding
- context_window_tradeoffs
- local_llm_memory_architecture
Sources
- https://arxiv.org/html/2507.14397v2
- https://docs.anyscale.com/llm/serving/benchmarking/metrics
- https://docs.nvidia.com/nim/benchmarking/llm/latest/metrics.html
- https://bentoml.com/llm/inference-optimization/kv-cache-offloading
- https://mbrenndoerfer.com/writing/kv-cache-memory-calculation-llm-inference-gpu
42v⁝ LLM inference speed is fundamentally memory-bandwidth bound