LLM inference speed is fundamentally memory-bandwidth bound

Core idea

During autoregressive decoding, LLM inference speed is primarily limited by memory bandwidth, not compute.

Each generated token requires reloading most model weights from memory.
Because modern accelerators already have excess compute capacity, data movement dominates latency.

This makes token generation speed approximately:

tokens_per_second ≈ memory_bandwidth / bytes_required_per_token


Mechanism

Weight streaming during decoding

In transformer decoding:

  • Batch size is often small (interactive usage)
  • Matrix multiplications are narrow
  • Compute units become underutilized
  • Weights must be re-read from HBM for each token

Typical assumptions:

  • Quantization: FP8 / INT4 (~2 bytes per parameter)
  • Active parameters: hundreds of billions

Example:

  • Model: ~400B active parameters
  • Memory transfer per token:

~400B × 2 bytes ≈ 800 GB per token

Therefore:

  • On 4 TB/s memory bandwidth hardware
  • Maximum theoretical speed:

~5 tokens/sec per chip (single-user decoding)

Scaling requires tensor parallelism across many chips.


KV-cache as second constraint

Inference memory usage is not only weights.

KV-cache grows with:

  • context length
  • number of users
  • hidden size
  • number of layers

Typical numbers:

  • 64K context
  • Large model

~385–881 GB KV cache for 1–32 users

This introduces:

  • memory capacity pressure
  • memory bandwidth contention
  • NUMA / cross-device synchronization overhead

Synchronization bottleneck

Large-scale inference requires:

  • tensor parallelism
  • pipeline parallelism
  • collective operations (all-reduce)

Typical latency:

  • ~1–10 microseconds across large GPU clusters

At very high bandwidth:

  • communication latency becomes dominant
  • further scaling requires sub-microsecond synchronization

This creates a hard scaling wall.


Observed hardware scaling limits

Memory techBandwidth / chipExpected UTPS (405B model)
HBM3e~4 TB/s~750–800
HBM4~18 TB/s~1500
3D stacked DRAM~30 TB/s~2000–2800
SRAM-only accelerator~117 TB/s~2500+ (capacity limited)

Beyond ~3000 UTPS, pure hardware scaling yields diminishing returns.

Reaching >10k UTPS likely requires algorithmic changes.


Important insight

Modern LLM inference is a data-movement problem disguised as a compute problem.

Compute utilization during decoding can be below 1%.

This is counter-intuitive for engineers trained in HPC or deep learning training.


Implications

For local LLM setups

The most important factors are:

  • VRAM bandwidth
  • RAM bandwidth (for CPU inference)
  • memory topology
  • interconnect latency

Clock speed or FLOPS matter less than expected.

This explains:

  • why fast RAM improves tokens/sec
  • why multi-GPU scaling is difficult
  • why quantization helps mostly by reducing bandwidth pressure

For future model design

To scale inference speed, the industry may move toward:

  • sparse MoE routing
  • speculative decoding
  • state-space models
  • smaller context windows
  • KV-cache compression
  • weight streaming optimizations
  • on-chip SRAM accelerators

Metrics

UTPS - User Tokens Per Second

Measures:

perceived token generation speed for a single user

Approximation:

UTPS ≈ 1 / inter_token_latency

STPS - System Tokens Per Second

Measures:

  • total throughput across all users

Trade-off:

  • increasing concurrency → lower UTPS
  • but higher STPS (until saturation)

Connections


Sources