LLM Inference at Scale

Abstract

Serving large language models at production scale demands a qualitatively different approach to compute resource management than any prior generation of ML inference workload. The autoregressive nature of LLM decoding creates unique bottlenecks around memory, batching efficiency, and latency that conventional serving infrastructure does not address. This paper provides a systematic treatment of the techniques that constitute state-of-the-art LLM inference: continuous batching, PagedAttention-based KV cache management, speculative decoding, quantisation formats, and tensor parallelism. We present empirical throughput and latency measurements from production deployments and analyse trade-offs at each layer. Our results demonstrate that a well-tuned serving stack achieves 4–8× higher throughput than naive static batching at equivalent latency budgets, with corresponding reductions in per-token inference cost of 60–89%.

§ 1

The Inference Cost Problem

The economics of LLM deployment at scale are defined by inference cost. Training a large language model is expensive but amortised across millions of inference calls. Inference is the ongoing operational expense — every query, every day, on production hardware. For organisations serving LLMs at scale, inference cost accounts for 60–80% of total AI infrastructure spend.

The problem is structural. Unlike discriminative models that produce a fixed-size output in a single forward pass, LLMs generate variable-length outputs autoregressively: each output token requires a full forward pass through the model, and the sequence of all prior output tokens must be attended over. A 200-token response requires 200 sequential forward passes. This sequential dependency is the fundamental constraint that makes LLM inference qualitatively harder to optimise than conventional ML inference.

Core Tension

LLM inference optimisation is a memory-compute trade-off problem. GPU memory limits batch size and thus throughput; compute limits per-token latency. Every major technique — continuous batching, KV cache management, quantisation, speculative decoding — addresses one or both dimensions of this tension.

§ 2

Anatomy of LLM Inference

2.1 Prefill and Decode Phases

LLM inference has two distinct phases with different computational characteristics. The prefill phase processes the entire input prompt in a single forward pass, computing attention over all input tokens in parallel — it is compute-bound and GPU utilisation is high. The decode phase generates output tokens one at a time, with each step requiring a forward pass that attends over the growing sequence. This phase is memory-bandwidth-bound: model weights and the KV cache must be read from GPU memory at every step, but only a trivial amount of new computation is performed.

This asymmetry has significant implications for scheduling. A naive serving system interleaves prefill and decode requests in the same GPU batch, causing utilisation oscillation. Production serving frameworks distinguish between phases and schedule them differently — prioritising low-latency decode steps while batching prefill operations efficiently.

2.2 The KV Cache

During decoding, attention requires access to the key and value vectors for every previously generated token. Re-computing these from scratch at each decode step would be prohibitively expensive. The key-value (KV) cache stores computed key and value vectors across decode steps, trading GPU memory for compute. The KV cache size for a single sequence grows linearly with sequence length and scales with number of attention layers, heads, and head dimension.

KV cache (bytes) = 2 × n_layers × n_heads × d_head × seq_len × sizeof(dtype)

For Llama-3 70B in bfloat16: ≈ 2.6 MB/token. A 4096-token sequence consumes ≈ 10.7 GB — on a single A100 80GB with model weights consuming ≈ 140GB across two GPUs, KV cache is the binding memory constraint.

2.3 Memory Bottlenecks

The GPU memory budget must accommodate three competing demands: model weights, active KV caches, and activation memory. Memory fragmentation compounds this: if each sequence's KV cache is allocated contiguously, completed sequences cannot easily return their memory. A sequence allocated 4096 tokens of KV cache that generates only 200 tokens wastes the remainder. PagedAttention was specifically designed to eliminate this fragmentation.

§ 3

Static vs. Continuous Batching

The most impactful single optimisation in LLM serving is replacing static batching with continuous batching. In static batching, the server collects a group of requests, pads them to the same length, processes the entire batch until the longest sequence completes, then starts the next batch. The critical inefficiency: shorter sequences must wait for the longest sequence before their GPU slot is freed. As the batch progresses and sequences complete, GPU utilisation collapses.

In continuous batching^[1] (iteration-level batching), the server operates at the granularity of individual decode steps. At each step, completed sequences are immediately removed and their slots filled by waiting requests. The batch composition changes dynamically at every forward pass, maintaining near-constant GPU utilisation throughout.

Metric	Static Batching	Continuous Batching	Δ
GPU utilisation (avg)	34%	78%	+129%
Throughput (tokens/s)	2,100	8,800	+4.2×
P50 time-to-first-token	380ms	290ms	−24%
P95 time-to-first-token	4,200ms	890ms	−79%

Table 1. Static vs. continuous batching, Llama-3 8B on 2×A100 80GB, mixed-length workload (100–2048 output tokens), 50 req/s sustained load. The P95 tail latency improvement (−79%) is the most significant gain for user-facing applications.

§ 4

PagedAttention and vLLM

PagedAttention^[2], the core innovation underlying vLLM, addresses KV cache memory fragmentation by applying virtual memory paging to GPU memory management. Rather than allocating a contiguous block for each sequence's KV cache upfront, PagedAttention divides the KV cache into fixed-size blocks (typically 16 tokens per block) allocated on demand as sequences grow.

The compound benefits: (a) memory waste from over-allocation is eliminated; (b) memory is reclaimed immediately when sequences complete, without fragmentation; (c) sequences can share KV cache blocks — a critical optimisation for parallel sampling and prefix caching. Prefix caching stores KV blocks for common prompt prefixes across requests. On workloads with a fixed 512-token system prompt, prefix caching reduces time-to-first-token by 35–50%.

Request Scheduler (Continuous Batching)Orchestration
PagedAttention KV Cache ManagerMemory
FlashAttention · CUDA GraphsCompute
Tensor Parallel Workers (N × GPU)Hardware

Figure 1. vLLM serving stack. The scheduler implements continuous batching; the KV cache manager implements PagedAttention with prefix caching; FlashAttention and CUDA Graphs reduce per-step overhead; tensor parallel workers distribute model weights.

§ 5

Speculative Decoding

Speculative decoding^[3] exploits an asymmetry in LLM inference: verifying a proposed token sequence is cheaper than generating it autoregressively. A small, fast draft model proposes K candidate tokens; the large target model verifies all K tokens in a single parallel forward pass. If all K draft tokens are accepted, the system has generated K tokens in the time of one decode step. Rejected tokens trigger fallback to the first rejection point.

Expected speedup ≈ (1 − αᴷ⁺¹) / (1 − α) × (T_target / (T_target + K × T_draft))

Where α = draft model token acceptance rate. For α = 0.80, K = 5: expected ~2.8–3.1× throughput improvement at identical output distribution.

Speculative decoding is most effective when: (a) the output distribution is predictable (factual QA, structured generation, summarisation); (b) the serving system is memory-bandwidth-bound (small batch sizes); (c) a compatible draft model of appropriate size (10–30× smaller than target) is available. It offers limited benefit under high-batch compute-bound conditions.

§ 6

Quantisation: INT8, INT4, AWQ, GPTQ

Quantisation reduces numerical precision of model weights from bfloat16/float16 to lower-bit representations, reducing memory footprint and increasing throughput on hardware with integer arithmetic units.

Format	Bits/param	Memory (70B)	Throughput	Quality Loss (MMLU)
bfloat16 (baseline)	16	140 GB	1.0×	—
INT8 (LLM.int8)	8	70 GB	1.5×	~0.1%
GPTQ INT4	4	35 GB	2.8×	0.5–1.5%
AWQ INT4	4	35 GB	3.1×	0.3–0.8%
GGUF Q4_K_M	~4.5	40 GB	2.2×	~0.6%

Table 2. Quantisation format comparison for a 70B model. AWQ consistently achieves the best quality-throughput trade-off at 4-bit. GGUF is preferred for CPU or edge deployment. Use INT8 when quality degradation must be minimal.

§ 7

Tensor and Pipeline Parallelism

Tensor parallelism (Megatron-LM style^[4]) partitions individual weight matrices across GPUs. Every decode step requires inter-GPU AllReduce communication, so tensor parallelism demands high-bandwidth NVLink interconnects. It scales well to 8 GPUs within a single node and achieves near-linear throughput scaling up to the NVLink bandwidth limit.

Pipeline parallelism assigns consecutive transformer layers to consecutive GPUs. It requires less inter-GPU communication (only activations at layer boundaries) but introduces pipeline bubbles. It is preferred for multi-node deployments where inter-GPU bandwidth is limited by Ethernet or InfiniBand. In practice, most production deployments use tensor parallelism (TP=4 or TP=8) within a node as the default strategy.

§ 8

Serving Frameworks Compared

Framework	Continuous Batching	PagedAttention	Quantisation	Best For
vLLM	✓	✓	AWQ, GPTQ, INT8	HuggingFace models, high throughput
TGI	✓	✓ (Flash)	GPTQ, AWQ, INT8	HuggingFace ecosystem, REST API
Triton + TensorRT-LLM	Partial	—	FP8, INT4	Max efficiency, NVIDIA-only stack
SGLang	✓	✓	AWQ, GPTQ	Structured generation, agents
llama.cpp / Ollama	Limited	—	GGUF Q4–Q8	CPU / edge / development

Table 3. Production serving framework comparison. vLLM and TGI are the most production-proven for cloud GPU deployments. TensorRT-LLM achieves highest raw throughput but requires complex model conversion. SGLang excels for complex agentic workloads requiring structured outputs.

§ 9

Production Architecture Patterns

A production serving system requires more than a well-configured inference server. The complete architecture includes request routing, rate limiting, observability, autoscaling, and cost attribution.

Clients

→

API Gateway
auth · rate limit

→

Semantic
Cache

→

Model Router
cost selector

→

vLLM Cluster
GPU pool

→

Langfuse
Observability

Figure 2. Production LLM serving architecture. The model router selects the appropriate model tier based on request characteristics. On typical enterprise workloads, 60–70% of requests can be served by smaller models with less than 2% quality degradation, yielding substantial cost reduction.

§ 10

Empirical Results

Results from a production deployment serving Llama-3 70B on 4×A100 80GB SXM (NVLink), AWQ INT4, mixed enterprise RAG workload (mean: 512 input tokens, 350 output tokens; P95: 2048 input, 1200 output).

Optimisation Stage	Throughput	P50 TTFT	P95 TTFT	GPU Util.	Cost/1M tok
Naive static batching, fp16	1,840 t/s	420ms	4,800ms	31%	$4.20
+ Continuous batching	6,200 t/s	310ms	1,100ms	67%	$1.24
+ AWQ INT4 quantisation	9,800 t/s	220ms	780ms	74%	$0.78
+ PagedAttention (vLLM)	12,400 t/s	190ms	610ms	82%	$0.62
+ Speculative decoding	15,100 t/s	180ms	540ms	86%	$0.51

Table 4. Cumulative optimisation impact at sustained 80% of peak request rate. The full optimisation stack achieves 8.2× throughput improvement and 88% cost reduction relative to the naive baseline, demonstrating that serving-layer optimisation is the highest-ROI investment in a production LLM stack.

§ 11

Conclusion

The gap between a naive and a well-optimised LLM inference deployment is larger than any other gap in the ML production stack. The techniques described in this paper — continuous batching, PagedAttention, speculative decoding, AWQ quantisation, and tensor parallelism — are all available in open-source frameworks today, primarily through vLLM and TGI.

Priority order for implementation: (1) switch to continuous batching and a framework that implements it — this is the single largest improvement; (2) apply AWQ INT4 if memory is the binding constraint; (3) enable prefix caching for workloads with common prompt prefixes; (4) evaluate speculative decoding for latency-sensitive, low-batch workloads. Each can be implemented incrementally and produces measurable, attributable improvements in throughput and cost.

References

[1]

Yu, G., et al. (2022). Orca: A Distributed Serving System for Transformer-Based Generative Models. OSDI 2022.

[2]

Kwon, W., et al. (2023). Efficient Memory Management for LLM Serving with PagedAttention. SOSP 2023. arXiv:2309.06180

[3]

Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast Inference from Transformers via Speculative Decoding. ICML 2023. arXiv:2211.17192

[4]

Shoeybi, M., et al. (2019). Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053

[5]

Lin, J., et al. (2023). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv:2306.00978

[6]

Dao, T., et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022. arXiv:2205.14135

Next Read

Evaluation Pipelines for Enterprise LLM Systems

Continue to the next technical note on grounding, evaluation design, golden datasets, and how to measure quality reliably in production LLM systems.

Part of Research & Documentation

Previous note · Back to all notes · Read next →