LLM inference speed is fundamentally memory-bandwidth bound

Core idea

During autoregressive decoding, LLM inference speed is primarily limited by memory bandwidth, not compute.

Each generated token requires reloading most model weights from memory.
Because modern accelerators already have excess compute capacity, data movement dominates latency.

This makes token generation speed approximately:

tokens_per_second ≈ memory_bandwidth / bytes_required_per_token

Mechanism

Weight streaming during decoding

In transformer decoding:

Batch size is often small (interactive usage)
Matrix multiplications are narrow
Compute units become underutilized
Weights must be re-read from HBM for each token

Typical assumptions:

Quantization: FP8 / INT4 (~2 bytes per parameter)
Active parameters: hundreds of billions

Example:

Model: ~400B active parameters
Memory transfer per token:

~400B × 2 bytes ≈ 800 GB per token

Therefore:

On 4 TB/s memory bandwidth hardware
Maximum theoretical speed:

~5 tokens/sec per chip (single-user decoding)

Scaling requires tensor parallelism across many chips.

KV-cache as second constraint

Inference memory usage is not only weights.

KV-cache grows with:

context length
number of users
hidden size
number of layers

Typical numbers:

64K context
Large model

→ ~385–881 GB KV cache for 1–32 users

This introduces:

memory capacity pressure
memory bandwidth contention
NUMA / cross-device synchronization overhead

Synchronization bottleneck

Large-scale inference requires:

tensor parallelism
pipeline parallelism
collective operations (all-reduce)

Typical latency:

~1–10 microseconds across large GPU clusters

At very high bandwidth:

communication latency becomes dominant
further scaling requires sub-microsecond synchronization

This creates a hard scaling wall.

Observed hardware scaling limits

Memory tech	Bandwidth / chip	Expected UTPS (405B model)
HBM3e	~4 TB/s	~750–800
HBM4	~18 TB/s	~1500
3D stacked DRAM	~30 TB/s	~2000–2800
SRAM-only accelerator	~117 TB/s	~2500+ (capacity limited)

Beyond ~3000 UTPS, pure hardware scaling yields diminishing returns.

Reaching >10k UTPS likely requires algorithmic changes.

Important insight

Modern LLM inference is a data-movement problem disguised as a compute problem.

Compute utilization during decoding can be below 1%.

This is counter-intuitive for engineers trained in HPC or deep learning training.

Implications

For local LLM setups

The most important factors are:

VRAM bandwidth
RAM bandwidth (for CPU inference)
memory topology
interconnect latency

Clock speed or FLOPS matter less than expected.

This explains:

why fast RAM improves tokens/sec
why multi-GPU scaling is difficult
why quantization helps mostly by reducing bandwidth pressure

For future model design

To scale inference speed, the industry may move toward:

sparse MoE routing
speculative decoding
state-space models
smaller context windows
KV-cache compression
weight streaming optimizations
on-chip SRAM accelerators

Metrics

UTPS - User Tokens Per Second

Measures:

perceived token generation speed for a single user

Approximation:

UTPS ≈ 1 / inter_token_latency

STPS - System Tokens Per Second

Measures:

total throughput across all users

Trade-off:

increasing concurrency → lower UTPS
but higher STPS (until saturation)

A Personal Journal of Learning and Discovery

Archive

42v⁝ LLM inference speed is fundamentally memory-bandwidth bound