KV-cache scaling problem in LLM inference

Core idea

During autoregressive decoding, KV-cache introduces a second fundamental scaling limit - independent of model weights.

While weight streaming creates a constant bandwidth cost per token,
KV-cache creates a state that grows linearly with sequence length and number of users.

At sufficiently long contexts, KV-cache becomes the dominant constraint in:

memory capacity
memory bandwidth
inter-GPU communication
achievable throughput (UTPS / STPS)

Mechanism

In transformer decoding, every generated token produces:

new Key vector
new Value vector

for every attention layer.

These vectors must be stored because future tokens attend to all previous tokens.

Therefore KV-cache grows with:

context length (L)
number of layers (N)
model hidden size
number of concurrent users (B)

KV-cache scaling formula

Approximate memory usage:

KV_bytes ≈ 2 × B × L × N × H_kv × d_kv × bytes_per_element

Where:

2 → Keys + Values
B → number of users (batch)
L → sequence length
N → number of transformer layers
H_kv → number of KV heads
d_kv → head dimension
bytes_per_element → precision (FP16=2, INT8=1, INT4≈0.5)

Simplified approximation:

KV_bytes ≈ 2 × B × L × N × d_model × bytes

Because:

H_kv × d_kv ≈ d_model

Example intuition

Large model (~70B)

context = 64k
FP16 KV
1 user

→ KV-cache ≈ ~150–180 GB

Small model (~7B)

context = 64k
FP16 KV
1 user

→ KV-cache ≈ ~30–35 GB

Concurrency scaling

KV-cache scales linearly with users:

8 users → ~8× memory
32 users → ~32× memory

This makes multi-user inference extremely memory-intensive.

Why KV-cache becomes dominant

Phase 1 - weights dominate

At short context:

weight streaming bandwidth cost dominates
KV-cache is small
throughput limited by HBM bandwidth

Phase 2 - crossover point

At longer context:

KV-cache ≈ weight size
memory pressure increases
scheduling becomes harder

Phase 3 - KV-cache dominates

At very long context:

KV-cache is main memory consumer
bandwidth spent on cache reads/writes
cross-device communication increases
throughput collapses

This creates a second scaling wall.

Bandwidth impact

KV-cache increases:

attention read traffic
write traffic per token
GPU memory fragmentation
NUMA traffic
NVLink / PCIe / InfiniBand usage

Attention complexity becomes:

O(L)

per generated token.

Thus:

longer context → higher latency
longer context → lower tokens/sec
longer context → worse scaling efficiency

Distributed inference impact

When KV-cache is sharded across GPUs:

each token requires cross-device attention data
all-gather / reduce operations increase
synchronization latency becomes dominant

At high bandwidth hardware:

interconnect latency - not compute - becomes the bottleneck

This explains why:

adding GPUs does not scale linearly
ultra-fast memory alone does not solve inference scaling

Relationship to main bandwidth limit

Inference has two fundamental limits

Limit 1 - weight streaming

tokens/sec ≈ memory_bandwidth / bytes_weights_per_token

Constant per token.

Limit 2 - KV-cache growth

memory_usage ≈ B × L

Dynamic over time.

Together they define:

maximum context
maximum concurrency
achievable throughput

System design implications

KV-cache scaling motivates:

KV quantization
KV compression (low-rank / eviction)
sliding window attention
grouped query attention (GQA)
mixture-of-experts routing
speculative decoding
state-space models (SSM)
retrieval instead of long context
SRAM-heavy accelerators

Key insight

LLM inference scaling is limited not by math, but by state.

Weights are static.

KV-cache is growing state.

In large systems:

managing state movement becomes harder than performing computation.

A Personal Journal of Learning and Discovery

Archive

kv_cache_scaling_problem