KV-cache scaling problem in LLM inference

Core idea

During autoregressive decoding, KV-cache introduces a second fundamental scaling limit - independent of model weights.

While weight streaming creates a constant bandwidth cost per token,
KV-cache creates a state that grows linearly with sequence length and number of users.

At sufficiently long contexts, KV-cache becomes the dominant constraint in:

  • memory capacity
  • memory bandwidth
  • inter-GPU communication
  • achievable throughput (UTPS / STPS)

Mechanism

In transformer decoding, every generated token produces:

  • new Key vector
  • new Value vector

for every attention layer.

These vectors must be stored because future tokens attend to all previous tokens.

Therefore KV-cache grows with:

  • context length (L)
  • number of layers (N)
  • model hidden size
  • number of concurrent users (B)

KV-cache scaling formula

Approximate memory usage:

KV_bytes ≈ 2 × B × L × N × H_kv × d_kv × bytes_per_element

Where:

  • 2 → Keys + Values
  • B → number of users (batch)
  • L → sequence length
  • N → number of transformer layers
  • H_kv → number of KV heads
  • d_kv → head dimension
  • bytes_per_element → precision (FP16=2, INT8=1, INT4≈0.5)

Simplified approximation:

KV_bytes ≈ 2 × B × L × N × d_model × bytes

Because:

H_kv × d_kv ≈ d_model


Example intuition

Large model (~70B)

  • context = 64k
  • FP16 KV
  • 1 user

→ KV-cache ≈ ~150–180 GB

Small model (~7B)

  • context = 64k
  • FP16 KV
  • 1 user

→ KV-cache ≈ ~30–35 GB

Concurrency scaling

KV-cache scales linearly with users:

  • 8 users → ~8× memory
  • 32 users → ~32× memory

This makes multi-user inference extremely memory-intensive.


Why KV-cache becomes dominant

Phase 1 - weights dominate

At short context:

  • weight streaming bandwidth cost dominates
  • KV-cache is small
  • throughput limited by HBM bandwidth

Phase 2 - crossover point

At longer context:

  • KV-cache ≈ weight size
  • memory pressure increases
  • scheduling becomes harder

Phase 3 - KV-cache dominates

At very long context:

  • KV-cache is main memory consumer
  • bandwidth spent on cache reads/writes
  • cross-device communication increases
  • throughput collapses

This creates a second scaling wall.


Bandwidth impact

KV-cache increases:

  • attention read traffic
  • write traffic per token
  • GPU memory fragmentation
  • NUMA traffic
  • NVLink / PCIe / InfiniBand usage

Attention complexity becomes:

O(L)

per generated token.

Thus:

  • longer context → higher latency
  • longer context → lower tokens/sec
  • longer context → worse scaling efficiency

Distributed inference impact

When KV-cache is sharded across GPUs:

  • each token requires cross-device attention data
  • all-gather / reduce operations increase
  • synchronization latency becomes dominant

At high bandwidth hardware:

interconnect latency - not compute - becomes the bottleneck

This explains why:

  • adding GPUs does not scale linearly
  • ultra-fast memory alone does not solve inference scaling

Relationship to main bandwidth limit

Inference has two fundamental limits

Limit 1 - weight streaming

tokens/sec ≈ memory_bandwidth / bytes_weights_per_token

Constant per token.

Limit 2 - KV-cache growth

memory_usage ≈ B × L

Dynamic over time.

Together they define:

  • maximum context
  • maximum concurrency
  • achievable throughput

System design implications

KV-cache scaling motivates:

  • KV quantization
  • KV compression (low-rank / eviction)
  • sliding window attention
  • grouped query attention (GQA)
  • mixture-of-experts routing
  • speculative decoding
  • state-space models (SSM)
  • retrieval instead of long context
  • SRAM-heavy accelerators

Key insight

LLM inference scaling is limited not by math, but by state.

Weights are static.

KV-cache is growing state.

In large systems:

managing state movement becomes harder than performing computation.


Connections


Sources