LLM inference speed is fundamentally memory-bandwidth bound
Core idea
During autoregressive decoding, LLM inference speed is primarily limited by memory bandwidth, not compute.
Each generated token requires reloading most model weights from memory.
Because modern accelerators already have excess compute capacity, data movement dominates latency.
This makes token generation speed approximately:
tokens_per_second ≈ memory_bandwidth / bytes_required_per_token
Mechanism
Weight streaming during decoding
In transformer decoding:
- Batch size is often small (interactive usage)
- Matrix multiplications are narrow
- Compute units become underutilized
- Weights must be re-read from HBM for each token
Typical assumptions:
- Quantization: FP8 / INT4 (~2 bytes per parameter)
- Active parameters: hundreds of billions
Example:
- Model: ~400B active parameters
- Memory transfer per token:
~400B × 2 bytes ≈ 800 GB per token
Therefore:
- On 4 TB/s memory bandwidth hardware
- Maximum theoretical speed:
~5 tokens/sec per chip (single-user decoding)
Scaling requires tensor parallelism across many chips.
KV-cache as second constraint
Inference memory usage is not only weights.
KV-cache grows with:
- context length
- number of users
- hidden size
- number of layers
Typical numbers:
- 64K context
- Large model
→ ~385–881 GB KV cache for 1–32 users
This introduces:
- memory capacity pressure
- memory bandwidth contention
- NUMA / cross-device synchronization overhead
Synchronization bottleneck
Large-scale inference requires:
- tensor parallelism
- pipeline parallelism
- collective operations (all-reduce)
Typical latency:
- ~1–10 microseconds across large GPU clusters
At very high bandwidth:
- communication latency becomes dominant
- further scaling requires sub-microsecond synchronization
This creates a hard scaling wall.
Observed hardware scaling limits
| Memory tech | Bandwidth / chip | Expected UTPS (405B model) |
|---|---|---|
| HBM3e | ~4 TB/s | ~750–800 |
| HBM4 | ~18 TB/s | ~1500 |
| 3D stacked DRAM | ~30 TB/s | ~2000–2800 |
| SRAM-only accelerator | ~117 TB/s | ~2500+ (capacity limited) |
Beyond ~3000 UTPS, pure hardware scaling yields diminishing returns.
Reaching >10k UTPS likely requires algorithmic changes.
Important insight
Modern LLM inference is a data-movement problem disguised as a compute problem.
Compute utilization during decoding can be below 1%.
This is counter-intuitive for engineers trained in HPC or deep learning training.
Implications
For local LLM setups
The most important factors are:
- VRAM bandwidth
- RAM bandwidth (for CPU inference)
- memory topology
- interconnect latency
Clock speed or FLOPS matter less than expected.
This explains:
- why fast RAM improves tokens/sec
- why multi-GPU scaling is difficult
- why quantization helps mostly by reducing bandwidth pressure
For future model design
To scale inference speed, the industry may move toward:
- sparse MoE routing
- speculative decoding
- state-space models
- smaller context windows
- KV-cache compression
- weight streaming optimizations
- on-chip SRAM accelerators
Metrics
UTPS - User Tokens Per Second
Measures:
perceived token generation speed for a single user
Approximation:
UTPS ≈ 1 / inter_token_latency
STPS - System Tokens Per Second
Measures:
- total throughput across all users
Trade-off:
- increasing concurrency → lower UTPS
- but higher STPS (until saturation)
Connections
- 42v1⁝ KV-cache scaling problem in LLM inference
- transformer_autoregressive_decoding
- llm_quantization_bandwidth_tradeoffs
- multi_gpu_collective_latency
- speculative_decoding
- moe_vs_dense_models
- local_llm_hardware_requirements