Serving large language models at production scale demands a qualitatively different approach to compute resource management than any prior generation of ML inference workload. The autoregressive nature of LLM decoding creates unique bottlenecks around memory, batching efficiency, and latency that conventional serving infrastructure does not address. This paper provides a systematic treatment of the techniques that constitute state-of-the-art LLM inference: continuous batching, PagedAttention-based KV cache management, speculative decoding, quantisation formats, and tensor parallelism. We present empirical throughput and latency measurements from production deployments and analyse trade-offs at each layer. Our results demonstrate that a well-tuned serving stack achieves 4–8× higher throughput than naive static batching at equivalent latency budgets, with corresponding reductions in per-token inference cost of 60–89%.
The Inference Cost Problem
The economics of LLM deployment at scale are defined by inference cost. Training a large language model is expensive but amortised across millions of inference calls. Inference is the ongoing operational expense — every query, every day, on production hardware. For organisations serving LLMs at scale, inference cost accounts for 60–80% of total AI infrastructure spend.
The problem is structural. Unlike discriminative models that produce a fixed-size output in a single forward pass, LLMs generate variable-length outputs autoregressively: each output token requires a full forward pass through the model, and the sequence of all prior output tokens must be attended over. A 200-token response requires 200 sequential forward passes. This sequential dependency is the fundamental constraint that makes LLM inference qualitatively harder to optimise than conventional ML inference.
LLM inference optimisation is a memory-compute trade-off problem. GPU memory limits batch size and thus throughput; compute limits per-token latency. Every major technique — continuous batching, KV cache management, quantisation, speculative decoding — addresses one or both dimensions of this tension.
Anatomy of LLM Inference
2.1 Prefill and Decode Phases
LLM inference has two distinct phases with different computational characteristics. The prefill phase processes the entire input prompt in a single forward pass, computing attention over all input tokens in parallel — it is compute-bound and GPU utilisation is high. The decode phase generates output tokens one at a time, with each step requiring a forward pass that attends over the growing sequence. This phase is memory-bandwidth-bound: model weights and the KV cache must be read from GPU memory at every step, but only a trivial amount of new computation is performed.
This asymmetry has significant implications for scheduling. A naive serving system interleaves prefill and decode requests in the same GPU batch, causing utilisation oscillation. Production serving frameworks distinguish between phases and schedule them differently — prioritising low-latency decode steps while batching prefill operations efficiently.
2.2 The KV Cache
During decoding, attention requires access to the key and value vectors for every previously generated token. Re-computing these from scratch at each decode step would be prohibitively expensive. The key-value (KV) cache stores computed key and value vectors across decode steps, trading GPU memory for compute. The KV cache size for a single sequence grows linearly with sequence length and scales with number of attention layers, heads, and head dimension.
2.3 Memory Bottlenecks
The GPU memory budget must accommodate three competing demands: model weights, active KV caches, and activation memory. Memory fragmentation compounds this: if each sequence's KV cache is allocated contiguously, completed sequences cannot easily return their memory. A sequence allocated 4096 tokens of KV cache that generates only 200 tokens wastes the remainder. PagedAttention was specifically designed to eliminate this fragmentation.
Static vs. Continuous Batching
The most impactful single optimisation in LLM serving is replacing static batching with continuous batching. In static batching, the server collects a group of requests, pads them to the same length, processes the entire batch until the longest sequence completes, then starts the next batch. The critical inefficiency: shorter sequences must wait for the longest sequence before their GPU slot is freed. As the batch progresses and sequences complete, GPU utilisation collapses.
In continuous batching[1] (iteration-level batching), the server operates at the granularity of individual decode steps. At each step, completed sequences are immediately removed and their slots filled by waiting requests. The batch composition changes dynamically at every forward pass, maintaining near-constant GPU utilisation throughout.
| Metric | Static Batching | Continuous Batching | Δ |
|---|---|---|---|
| GPU utilisation (avg) | 34% | 78% | +129% |
| Throughput (tokens/s) | 2,100 | 8,800 | +4.2× |
| P50 time-to-first-token | 380ms | 290ms | −24% |
| P95 time-to-first-token | 4,200ms | 890ms | −79% |
PagedAttention and vLLM
PagedAttention[2], the core innovation underlying vLLM, addresses KV cache memory fragmentation by applying virtual memory paging to GPU memory management. Rather than allocating a contiguous block for each sequence's KV cache upfront, PagedAttention divides the KV cache into fixed-size blocks (typically 16 tokens per block) allocated on demand as sequences grow.
The compound benefits: (a) memory waste from over-allocation is eliminated; (b) memory is reclaimed immediately when sequences complete, without fragmentation; (c) sequences can share KV cache blocks — a critical optimisation for parallel sampling and prefix caching. Prefix caching stores KV blocks for common prompt prefixes across requests. On workloads with a fixed 512-token system prompt, prefix caching reduces time-to-first-token by 35–50%.
Speculative Decoding
Speculative decoding[3] exploits an asymmetry in LLM inference: verifying a proposed token sequence is cheaper than generating it autoregressively. A small, fast draft model proposes K candidate tokens; the large target model verifies all K tokens in a single parallel forward pass. If all K draft tokens are accepted, the system has generated K tokens in the time of one decode step. Rejected tokens trigger fallback to the first rejection point.
Speculative decoding is most effective when: (a) the output distribution is predictable (factual QA, structured generation, summarisation); (b) the serving system is memory-bandwidth-bound (small batch sizes); (c) a compatible draft model of appropriate size (10–30× smaller than target) is available. It offers limited benefit under high-batch compute-bound conditions.
Quantisation: INT8, INT4, AWQ, GPTQ
Quantisation reduces numerical precision of model weights from bfloat16/float16 to lower-bit representations, reducing memory footprint and increasing throughput on hardware with integer arithmetic units.
| Format | Bits/param | Memory (70B) | Throughput | Quality Loss (MMLU) |
|---|---|---|---|---|
| bfloat16 (baseline) | 16 | 140 GB | 1.0× | — |
| INT8 (LLM.int8) | 8 | 70 GB | 1.5× | ~0.1% |
| GPTQ INT4 | 4 | 35 GB | 2.8× | 0.5–1.5% |
| AWQ INT4 | 4 | 35 GB | 3.1× | 0.3–0.8% |
| GGUF Q4_K_M | ~4.5 | 40 GB | 2.2× | ~0.6% |
Tensor and Pipeline Parallelism
Tensor parallelism (Megatron-LM style[4]) partitions individual weight matrices across GPUs. Every decode step requires inter-GPU AllReduce communication, so tensor parallelism demands high-bandwidth NVLink interconnects. It scales well to 8 GPUs within a single node and achieves near-linear throughput scaling up to the NVLink bandwidth limit.
Pipeline parallelism assigns consecutive transformer layers to consecutive GPUs. It requires less inter-GPU communication (only activations at layer boundaries) but introduces pipeline bubbles. It is preferred for multi-node deployments where inter-GPU bandwidth is limited by Ethernet or InfiniBand. In practice, most production deployments use tensor parallelism (TP=4 or TP=8) within a node as the default strategy.
Serving Frameworks Compared
| Framework | Continuous Batching | PagedAttention | Quantisation | Best For |
|---|---|---|---|---|
| vLLM | ✓ | ✓ | AWQ, GPTQ, INT8 | HuggingFace models, high throughput |
| TGI | ✓ | ✓ (Flash) | GPTQ, AWQ, INT8 | HuggingFace ecosystem, REST API |
| Triton + TensorRT-LLM | Partial | — | FP8, INT4 | Max efficiency, NVIDIA-only stack |
| SGLang | ✓ | ✓ | AWQ, GPTQ | Structured generation, agents |
| llama.cpp / Ollama | Limited | — | GGUF Q4–Q8 | CPU / edge / development |
Production Architecture Patterns
A production serving system requires more than a well-configured inference server. The complete architecture includes request routing, rate limiting, observability, autoscaling, and cost attribution.
auth · rate limit
Cache
cost selector
GPU pool
Observability
Empirical Results
Results from a production deployment serving Llama-3 70B on 4×A100 80GB SXM (NVLink), AWQ INT4, mixed enterprise RAG workload (mean: 512 input tokens, 350 output tokens; P95: 2048 input, 1200 output).
| Optimisation Stage | Throughput | P50 TTFT | P95 TTFT | GPU Util. | Cost/1M tok |
|---|---|---|---|---|---|
| Naive static batching, fp16 | 1,840 t/s | 420ms | 4,800ms | 31% | $4.20 |
| + Continuous batching | 6,200 t/s | 310ms | 1,100ms | 67% | $1.24 |
| + AWQ INT4 quantisation | 9,800 t/s | 220ms | 780ms | 74% | $0.78 |
| + PagedAttention (vLLM) | 12,400 t/s | 190ms | 610ms | 82% | $0.62 |
| + Speculative decoding | 15,100 t/s | 180ms | 540ms | 86% | $0.51 |
Conclusion
The gap between a naive and a well-optimised LLM inference deployment is larger than any other gap in the ML production stack. The techniques described in this paper — continuous batching, PagedAttention, speculative decoding, AWQ quantisation, and tensor parallelism — are all available in open-source frameworks today, primarily through vLLM and TGI.
Priority order for implementation: (1) switch to continuous batching and a framework that implements it — this is the single largest improvement; (2) apply AWQ INT4 if memory is the binding constraint; (3) enable prefix caching for workloads with common prompt prefixes; (4) evaluate speculative decoding for latency-sensitive, low-batch workloads. Each can be implemented incrementally and produces measurable, attributable improvements in throughput and cost.