Deploying LLMs in production without a systematic evaluation framework is engineering malpractice. Without measurement, teams cannot distinguish improvements from regressions, cannot detect silent quality degradation over time, and cannot make principled decisions about model updates. This paper describes a comprehensive evaluation pipeline architecture for enterprise LLM systems, covering: golden dataset construction and maintenance, automated reference-based and reference-free scoring metrics, LLM-as-judge evaluation patterns, hallucination and faithfulness assessment, human evaluation workflows, and integration with CI/CD pipelines for continuous quality assurance. We draw on operational experience from multiple production deployments and provide concrete implementation guidance, benchmark comparisons, and failure mode analysis. We demonstrate that teams with mature evaluation pipelines ship higher-quality systems faster and with greater confidence than teams that rely on ad hoc testing.
Why Evaluation is Hard for LLMs
Evaluating traditional ML systems is relatively straightforward: you hold out a test set, run inference, and compute a well-defined metric (accuracy, AUC, F1, RMSE). The metric is objective, reproducible, and correlates meaningfully with business outcome. LLM evaluation has none of these properties by default.
The fundamental challenge is that LLM outputs are open-ended and high-dimensional. Two responses that would receive identical ratings from a human expert — both correct, both clearly expressed, both appropriately cited — may differ in every surface feature: vocabulary, sentence structure, length, formatting. Metrics that compare surface form (BLEU, ROUGE) systematically undervalue paraphrase and overvalue verbatim repetition, making them poor proxies for actual quality.
A second challenge is task heterogeneity. A single production RAG system may handle factual questions, multi-step reasoning queries, summarisation requests, and structured data extraction — each requiring different evaluation criteria and metrics. No single metric captures quality across all task types.
Evaluation quality determines the speed of system improvement. A team with a high-quality evaluation pipeline can safely iterate 5–10× faster than a team without one, because every change can be validated quantitatively before deployment.
The Evaluation Stack: A Taxonomy
A complete LLM evaluation stack operates at four levels, each measuring different properties and operating at different cadences.
| Level | What it measures | Method | Cadence |
|---|---|---|---|
| Unit | Individual component quality (retriever, reranker, generator) | Component-level golden sets, automated metrics | Every commit |
| Integration | End-to-end pipeline quality on defined task distribution | RAGAS, LLM-as-judge, reference-based metrics | Pre-deployment, weekly |
| Regression | Quality changes relative to previous version or baseline | A/B comparison on golden set, statistical significance testing | Every release candidate |
| Production | Live quality signals, distribution drift, edge case discovery | Human sampling, implicit feedback, embedding drift detection | Continuous |
Golden Dataset Construction
3.1 What Belongs in a Golden Set
A golden dataset is a curated collection of (input, expected output) pairs that represents the task distribution the system will face in production, with sufficient coverage of edge cases, difficult queries, and failure modes that have been observed or anticipated. The minimum viable golden set for a production RAG system has three components:
- Core competency queries (60%): questions that a well-functioning system should answer correctly with high confidence. These establish the baseline.
- Edge cases and adversarial queries (25%): ambiguous questions, out-of-domain queries, questions with no answer in the corpus, queries designed to elicit hallucination. These are the queries that distinguish a robust system from a brittle one.
- Regression tests (15%): specific (query, failure) pairs from production incidents. Every production failure should generate at least one regression test.
3.2 Annotation Guidelines
Annotation quality determines evaluation quality. Vague annotation guidelines produce inconsistent labels that reduce the signal-to-noise ratio of the golden set. For enterprise RAG systems, we recommend annotating each golden pair with: a ground-truth answer, a list of acceptable paraphrases, a list of required facts that must be present in a correct answer, a list of forbidden claims (common hallucinations or near-miss errors), and a relevance label for each candidate passage in the corpus.
The relevance labels for corpus passages are particularly valuable: they enable retrieval evaluation to be conducted independently of generation evaluation, allowing the two subsystems to be improved separately.
3.3 Dataset Versioning and Drift
A golden dataset is a living artefact that must be versioned and curated over time. As the underlying corpus changes, as new query patterns emerge in production, and as the task definition evolves, the golden set must be updated. We maintain golden sets in version-controlled repositories alongside the system code, with every golden set change logged with a justification. Ratchet metrics — where each release must equal or exceed the previous release's score — are enforced as CI/CD gates.
Reference-Based Automated Metrics
Reference-based metrics compare model outputs to one or more reference answers. They are fast, cheap, and reproducible — and they have well-documented limitations that must be understood before applying them.
| Metric | What it measures | Strengths | Limitations |
|---|---|---|---|
| ROUGE-L | Longest common subsequence overlap | Fast, interpretable, good for extractive tasks | Penalises paraphrase; poor for abstractive generation |
| BERTScore | Contextual embedding similarity | Handles paraphrase; correlates better with human judgement | Computationally heavier; model-dependent |
| Exact Match | Character-level identity | Unambiguous for structured outputs | Too strict for free-form generation |
| F1 (token-level) | Token overlap between prediction and reference | Standard SQuAD metric; interpretable | Surface-form bias; misses semantic equivalence |
| BLEURT | Learned similarity to human ratings | Best correlation with human judgement | Requires reference; domain-sensitive |
Reference-Free Evaluation
5.1 LLM-as-Judge
Reference-free evaluation using a strong LLM as a judge has become the most practical approach for evaluating open-ended generation in enterprise settings[1]. Rather than comparing outputs to a reference answer, the judge LLM receives the query, the context, and the response and produces a quality rating on a defined rubric.
The key to reliable LLM-as-judge evaluation is a precise, detailed rubric that leaves minimal room for interpretation. The rubric should define exactly what constitutes a score of 1, 2, 3, 4, and 5 on each dimension (factual accuracy, answer completeness, citation quality, clarity) with concrete examples. Without a precise rubric, LLM judges exhibit significant positional bias, verbosity bias, and self-preference bias that corrupt the evaluation signal.
Mitigation strategies for known biases in LLM-as-judge evaluation: (a) use a judge model from a different family than the generator being evaluated; (b) always use chain-of-thought reasoning before the final score; (c) run pairwise comparisons rather than absolute scoring where possible; (d) calibrate against a set of human-labelled examples before deploying the judge.
5.2 RAGAS Framework
RAGAS[2] (Retrieval-Augmented Generation Assessment) provides a reference-free evaluation framework specifically designed for RAG systems. It computes four metrics: Faithfulness (are all claims in the response supported by the retrieved context?), Answer Relevance (how well does the response address the query?), Context Precision (are the retrieved passages relevant to the query?), and Context Recall (do the retrieved passages cover the expected answer?).
RAGAS is particularly valuable for diagnosing where in the pipeline quality is being lost. A system with high Context Precision but low Faithfulness indicates that retrieval is working but generation is hallucinating. A system with low Context Recall indicates that the retriever is not surfacing relevant passages, regardless of generation quality.
Hallucination and Faithfulness Assessment
Hallucination is the production failure mode with the highest business consequence in enterprise RAG deployments. A customer support system that confidently provides incorrect policy information, a legal system that cites non-existent cases, or a financial system that fabricates regulatory requirements — these failures erode trust and create liability in ways that latency or availability failures do not.
We distinguish between two types of hallucination in the RAG setting. Intrinsic hallucination occurs when the model generates content that contradicts the retrieved passages. Extrinsic hallucination occurs when the model generates content that is neither supported nor contradicted by the retrieved passages — content generated from parametric memory rather than the provided context.
The most reliable automated hallucination detection approach is NLI-based faithfulness checking: each sentence in the generated response is evaluated against each retrieved passage using a natural language inference model. Sentences that are classified as contradiction relative to any retrieved passage are flagged as intrinsic hallucinations; sentences classified as neutral relative to all retrieved passages are flagged as potential extrinsic hallucinations.
from transformers import pipeline nli = pipeline( "text-classification", model="cross-encoder/nli-deberta-v3-base" ) def faithfulness_score(response: str, contexts: list[str]) -> dict: sentences = split_sentences(response) results = {} for sent in sentences: # Check sentence against each retrieved context labels = [] for ctx in contexts: r = nli(f"{ctx} [SEP] {sent}")[0] labels.append(r["label"]) # Contradiction in ANY context = hallucination flag if "CONTRADICTION" in labels: results[sent] = "hallucinated" elif "ENTAILMENT" not in labels: results[sent] = "unsupported" else: results[sent] = "grounded" return results
Human Evaluation Workflows
Automated metrics are necessary but not sufficient. Human evaluation catches systematic failures that automated metrics miss, provides ground truth for calibrating automated metrics, and ensures the system is actually solving the business problem it was designed for. The challenge is that human evaluation is expensive, slow, and can be inconsistent without careful workflow design.
We recommend a tiered human evaluation approach. Expert evaluation by domain specialists is conducted on a small sample (30–50 examples) before each major release, with annotators using a detailed rubric and spending 10–15 minutes per example. Crowd evaluation using non-specialist annotators is conducted on larger samples (200–500 examples) for preference ranking tasks where domain expertise is not required. User evaluation captures implicit signals from real users through thumbs-up/thumbs-down buttons, query reformulation patterns, and session abandonment rates.
Inter-annotator agreement should be measured on every batch of human evaluation. A Cohen's κ below 0.6 indicates the annotation task is underspecified — the rubric is ambiguous and the human evaluation data is unreliable regardless of sample size.
CI/CD Integration and Regression Gating
The evaluation pipeline must be integrated into the CI/CD workflow as a blocking gate for production deployments. A system that passes unit tests and integration tests but degrades on the LLM evaluation golden set should not be promoted to production — regardless of the absence of traditional engineering failures.
Push
component eval
RAGAS eval
Δ vs baseline
human sample
canary deploy
Regression gating requires defining a ratchet: the minimum acceptable score on each metric, set equal to or above the current production score. We use a 95% confidence interval around the baseline score as the ratchet, computed via bootstrap resampling over the golden set, to account for metric variance on small evaluation sets.
Production Monitoring and Drift Detection
Production monitoring for LLM systems is qualitatively different from monitoring for deterministic software. In addition to traditional infrastructure metrics (latency, error rate, throughput), you must monitor output quality metrics in real time.
The most practical approach is sampling-based monitoring: a fraction of production queries (typically 1–5%) is routed through the automated evaluation pipeline and scored in near-real-time. Control charts over the quality metrics detect degradation before it becomes visible to the majority of users. Alert thresholds are set at 2σ below the historical mean for each metric.
Embedding drift detection catches a subtler failure mode: the distribution of production queries shifting away from the distribution on which the system was evaluated. If production query embeddings begin drifting from the evaluation set embeddings, it is a signal that the golden set is becoming unrepresentative and must be updated. We monitor query embedding drift daily using Maximum Mean Discrepancy (MMD) between a rolling window of production queries and the golden set.
Empirical Results
We compare two production RAG deployments — one with a mature evaluation pipeline implemented from day one, one where evaluation was retrofitted six months after launch — on several operational dimensions over a 12-month period.
| Operational Metric | Evaluation-First Team | Evaluation-Retrofit Team |
|---|---|---|
| Production incidents caused by silent quality regression | 1 | 9 |
| Mean time to detect quality degradation | 4 hours | 6.2 days |
| Deployment frequency (releases/month) | 8.2 | 3.1 |
| Time spent on manual QA per release | 2.4 hours | 18 hours |
| Final system quality (RAGAS composite, month 12) | 0.83 | 0.71 |
Recommendations
- Build the golden set before writing production code. The discipline of defining what “correct” means before building forces clarity about the task that prevents expensive course corrections later.
- Separate retrieval evaluation from generation evaluation. Track retrieval metrics (Recall@K, NDCG@K) and generation metrics (Faithfulness, Answer Relevance) independently. A regression in retrieval quality masked by a generation improvement is a pipeline that is becoming fragile.
- Calibrate your automated metrics against human labels. Before trusting an LLM-as-judge or an NLI-based faithfulness score, measure its correlation with human judgements on 100+ examples. An uncalibrated automated metric optimising for the wrong thing is worse than no metric.
- Treat every production failure as a golden set addition. Every incident where the system produced a wrong or harmful output should generate at least one new golden set entry. This ensures the evaluation set evolves with the system’s actual failure modes.
- Gate deployments on quality metrics, not just engineering tests. A system that passes unit tests but regresses on RAGAS Faithfulness by more than 2σ should not ship. This requires buy-in from product and engineering leadership and must be established as policy before the first deployment.