Evaluation Pipelines for Enterprise LLM Systems

Abstract

Deploying LLMs in production without a systematic evaluation framework is engineering malpractice. Without measurement, teams cannot distinguish improvements from regressions, cannot detect silent quality degradation over time, and cannot make principled decisions about model updates. This paper describes a comprehensive evaluation pipeline architecture for enterprise LLM systems, covering: golden dataset construction and maintenance, automated reference-based and reference-free scoring metrics, LLM-as-judge evaluation patterns, hallucination and faithfulness assessment, human evaluation workflows, and integration with CI/CD pipelines for continuous quality assurance. We draw on operational experience from multiple production deployments and provide concrete implementation guidance, benchmark comparisons, and failure mode analysis. We demonstrate that teams with mature evaluation pipelines ship higher-quality systems faster and with greater confidence than teams that rely on ad hoc testing.

§ 1

Why Evaluation is Hard for LLMs

Evaluating traditional ML systems is relatively straightforward: you hold out a test set, run inference, and compute a well-defined metric (accuracy, AUC, F1, RMSE). The metric is objective, reproducible, and correlates meaningfully with business outcome. LLM evaluation has none of these properties by default.

The fundamental challenge is that LLM outputs are open-ended and high-dimensional. Two responses that would receive identical ratings from a human expert — both correct, both clearly expressed, both appropriately cited — may differ in every surface feature: vocabulary, sentence structure, length, formatting. Metrics that compare surface form (BLEU, ROUGE) systematically undervalue paraphrase and overvalue verbatim repetition, making them poor proxies for actual quality.

A second challenge is task heterogeneity. A single production RAG system may handle factual questions, multi-step reasoning queries, summarisation requests, and structured data extraction — each requiring different evaluation criteria and metrics. No single metric captures quality across all task types.

The Fundamental Evaluation Principle

Evaluation quality determines the speed of system improvement. A team with a high-quality evaluation pipeline can safely iterate 5–10× faster than a team without one, because every change can be validated quantitatively before deployment.

§ 2

The Evaluation Stack: A Taxonomy

A complete LLM evaluation stack operates at four levels, each measuring different properties and operating at different cadences.

Level	What it measures	Method	Cadence
Unit	Individual component quality (retriever, reranker, generator)	Component-level golden sets, automated metrics	Every commit
Integration	End-to-end pipeline quality on defined task distribution	RAGAS, LLM-as-judge, reference-based metrics	Pre-deployment, weekly
Regression	Quality changes relative to previous version or baseline	A/B comparison on golden set, statistical significance testing	Every release candidate
Production	Live quality signals, distribution drift, edge case discovery	Human sampling, implicit feedback, embedding drift detection	Continuous

Table 1. Four-level LLM evaluation taxonomy. Teams frequently implement production monitoring before unit testing, which is backwards: production monitoring catches failures after they affect users, while unit evaluation prevents them from shipping.

§ 3

Golden Dataset Construction

3.1 What Belongs in a Golden Set

A golden dataset is a curated collection of (input, expected output) pairs that represents the task distribution the system will face in production, with sufficient coverage of edge cases, difficult queries, and failure modes that have been observed or anticipated. The minimum viable golden set for a production RAG system has three components:

Core competency queries (60%): questions that a well-functioning system should answer correctly with high confidence. These establish the baseline.
Edge cases and adversarial queries (25%): ambiguous questions, out-of-domain queries, questions with no answer in the corpus, queries designed to elicit hallucination. These are the queries that distinguish a robust system from a brittle one.
Regression tests (15%): specific (query, failure) pairs from production incidents. Every production failure should generate at least one regression test.

3.2 Annotation Guidelines

Annotation quality determines evaluation quality. Vague annotation guidelines produce inconsistent labels that reduce the signal-to-noise ratio of the golden set. For enterprise RAG systems, we recommend annotating each golden pair with: a ground-truth answer, a list of acceptable paraphrases, a list of required facts that must be present in a correct answer, a list of forbidden claims (common hallucinations or near-miss errors), and a relevance label for each candidate passage in the corpus.

The relevance labels for corpus passages are particularly valuable: they enable retrieval evaluation to be conducted independently of generation evaluation, allowing the two subsystems to be improved separately.

3.3 Dataset Versioning and Drift

A golden dataset is a living artefact that must be versioned and curated over time. As the underlying corpus changes, as new query patterns emerge in production, and as the task definition evolves, the golden set must be updated. We maintain golden sets in version-controlled repositories alongside the system code, with every golden set change logged with a justification. Ratchet metrics — where each release must equal or exceed the previous release's score — are enforced as CI/CD gates.

§ 4

Reference-Based Automated Metrics

Reference-based metrics compare model outputs to one or more reference answers. They are fast, cheap, and reproducible — and they have well-documented limitations that must be understood before applying them.

Metric	What it measures	Strengths	Limitations
ROUGE-L	Longest common subsequence overlap	Fast, interpretable, good for extractive tasks	Penalises paraphrase; poor for abstractive generation
BERTScore	Contextual embedding similarity	Handles paraphrase; correlates better with human judgement	Computationally heavier; model-dependent
Exact Match	Character-level identity	Unambiguous for structured outputs	Too strict for free-form generation
F1 (token-level)	Token overlap between prediction and reference	Standard SQuAD metric; interpretable	Surface-form bias; misses semantic equivalence
BLEURT	Learned similarity to human ratings	Best correlation with human judgement	Requires reference; domain-sensitive

Table 2. Reference-based metric comparison. For most enterprise LLM evaluation tasks, BERTScore or BLEURT provides better signal than ROUGE-L, despite higher computational cost. Use ROUGE-L only where token overlap is meaningful (extractive summarisation, structured information extraction).

§ 5

Reference-Free Evaluation

5.1 LLM-as-Judge

Reference-free evaluation using a strong LLM as a judge has become the most practical approach for evaluating open-ended generation in enterprise settings^[1]. Rather than comparing outputs to a reference answer, the judge LLM receives the query, the context, and the response and produces a quality rating on a defined rubric.

The key to reliable LLM-as-judge evaluation is a precise, detailed rubric that leaves minimal room for interpretation. The rubric should define exactly what constitutes a score of 1, 2, 3, 4, and 5 on each dimension (factual accuracy, answer completeness, citation quality, clarity) with concrete examples. Without a precise rubric, LLM judges exhibit significant positional bias, verbosity bias, and self-preference bias that corrupt the evaluation signal.

LLM Judge Bias Mitigation

Mitigation strategies for known biases in LLM-as-judge evaluation: (a) use a judge model from a different family than the generator being evaluated; (b) always use chain-of-thought reasoning before the final score; (c) run pairwise comparisons rather than absolute scoring where possible; (d) calibrate against a set of human-labelled examples before deploying the judge.

5.2 RAGAS Framework

RAGAS^[2] (Retrieval-Augmented Generation Assessment) provides a reference-free evaluation framework specifically designed for RAG systems. It computes four metrics: Faithfulness (are all claims in the response supported by the retrieved context?), Answer Relevance (how well does the response address the query?), Context Precision (are the retrieved passages relevant to the query?), and Context Recall (do the retrieved passages cover the expected answer?).

RAGAS Score = harmonic_mean(Faithfulness, Answer_Relevance, Context_Precision, Context_Recall)

Individual dimensions are more informative than the composite score for diagnosing specific failure modes.

RAGAS is particularly valuable for diagnosing where in the pipeline quality is being lost. A system with high Context Precision but low Faithfulness indicates that retrieval is working but generation is hallucinating. A system with low Context Recall indicates that the retriever is not surfacing relevant passages, regardless of generation quality.

§ 6

Hallucination and Faithfulness Assessment

Hallucination is the production failure mode with the highest business consequence in enterprise RAG deployments. A customer support system that confidently provides incorrect policy information, a legal system that cites non-existent cases, or a financial system that fabricates regulatory requirements — these failures erode trust and create liability in ways that latency or availability failures do not.

We distinguish between two types of hallucination in the RAG setting. Intrinsic hallucination occurs when the model generates content that contradicts the retrieved passages. Extrinsic hallucination occurs when the model generates content that is neither supported nor contradicted by the retrieved passages — content generated from parametric memory rather than the provided context.

The most reliable automated hallucination detection approach is NLI-based faithfulness checking: each sentence in the generated response is evaluated against each retrieved passage using a natural language inference model. Sentences that are classified as contradiction relative to any retrieved passage are flagged as intrinsic hallucinations; sentences classified as neutral relative to all retrieved passages are flagged as potential extrinsic hallucinations.

Python · Sentence-Level Faithfulness Scoring

from transformers import pipeline

nli = pipeline(
    "text-classification",
    model="cross-encoder/nli-deberta-v3-base"
)

def faithfulness_score(response: str, contexts: list[str]) -> dict:
    sentences = split_sentences(response)
    results = {}

    for sent in sentences:
        # Check sentence against each retrieved context
        labels = []
        for ctx in contexts:
            r = nli(f"{ctx} [SEP] {sent}")[0]
            labels.append(r["label"])

        # Contradiction in ANY context = hallucination flag
        if "CONTRADICTION" in labels:
            results[sent] = "hallucinated"
        elif "ENTAILMENT" not in labels:
            results[sent] = "unsupported"
        else:
            results[sent] = "grounded"

    return results

§ 7

Human Evaluation Workflows

Automated metrics are necessary but not sufficient. Human evaluation catches systematic failures that automated metrics miss, provides ground truth for calibrating automated metrics, and ensures the system is actually solving the business problem it was designed for. The challenge is that human evaluation is expensive, slow, and can be inconsistent without careful workflow design.

We recommend a tiered human evaluation approach. Expert evaluation by domain specialists is conducted on a small sample (30–50 examples) before each major release, with annotators using a detailed rubric and spending 10–15 minutes per example. Crowd evaluation using non-specialist annotators is conducted on larger samples (200–500 examples) for preference ranking tasks where domain expertise is not required. User evaluation captures implicit signals from real users through thumbs-up/thumbs-down buttons, query reformulation patterns, and session abandonment rates.

Inter-annotator agreement should be measured on every batch of human evaluation. A Cohen's κ below 0.6 indicates the annotation task is underspecified — the rubric is ambiguous and the human evaluation data is unreliable regardless of sample size.

§ 8

CI/CD Integration and Regression Gating

The evaluation pipeline must be integrated into the CI/CD workflow as a blocking gate for production deployments. A system that passes unit tests and integration tests but degrades on the LLM evaluation golden set should not be promoted to production — regardless of the absence of traditional engineering failures.

Code
Push

→

Unit Tests
component eval

→

Integration
RAGAS eval

→

Regression
Δ vs baseline

→

Staging
human sample

→

Production
canary deploy

Figure 1. LLM CI/CD pipeline with evaluation gates. Each gate is a hard blocker: a pipeline that degrades on regression testing does not advance to staging, regardless of other test results. Canary deployment routes 5% of traffic to the new version before full rollout, with automatic rollback triggered by production quality metric degradation.

Regression gating requires defining a ratchet: the minimum acceptable score on each metric, set equal to or above the current production score. We use a 95% confidence interval around the baseline score as the ratchet, computed via bootstrap resampling over the golden set, to account for metric variance on small evaluation sets.

§ 9

Production Monitoring and Drift Detection

Production monitoring for LLM systems is qualitatively different from monitoring for deterministic software. In addition to traditional infrastructure metrics (latency, error rate, throughput), you must monitor output quality metrics in real time.

The most practical approach is sampling-based monitoring: a fraction of production queries (typically 1–5%) is routed through the automated evaluation pipeline and scored in near-real-time. Control charts over the quality metrics detect degradation before it becomes visible to the majority of users. Alert thresholds are set at 2σ below the historical mean for each metric.

Embedding drift detection catches a subtler failure mode: the distribution of production queries shifting away from the distribution on which the system was evaluated. If production query embeddings begin drifting from the evaluation set embeddings, it is a signal that the golden set is becoming unrepresentative and must be updated. We monitor query embedding drift daily using Maximum Mean Discrepancy (MMD) between a rolling window of production queries and the golden set.

§ 10

Empirical Results

We compare two production RAG deployments — one with a mature evaluation pipeline implemented from day one, one where evaluation was retrofitted six months after launch — on several operational dimensions over a 12-month period.

Operational Metric	Evaluation-First Team	Evaluation-Retrofit Team
Production incidents caused by silent quality regression	1	9
Mean time to detect quality degradation	4 hours	6.2 days
Deployment frequency (releases/month)	8.2	3.1
Time spent on manual QA per release	2.4 hours	18 hours
Final system quality (RAGAS composite, month 12)	0.83	0.71

Table 3. Comparison of evaluation-first vs. evaluation-retrofit team operational outcomes over 12 months. The evaluation-first team shipped faster, caught regressions earlier, spent less time on manual QA, and achieved a higher final system quality score — demonstrating that evaluation infrastructure is not overhead, but an enabler of velocity.

§ 11

Recommendations

Build the golden set before writing production code. The discipline of defining what “correct” means before building forces clarity about the task that prevents expensive course corrections later.
Separate retrieval evaluation from generation evaluation. Track retrieval metrics (Recall@K, NDCG@K) and generation metrics (Faithfulness, Answer Relevance) independently. A regression in retrieval quality masked by a generation improvement is a pipeline that is becoming fragile.
Calibrate your automated metrics against human labels. Before trusting an LLM-as-judge or an NLI-based faithfulness score, measure its correlation with human judgements on 100+ examples. An uncalibrated automated metric optimising for the wrong thing is worse than no metric.
Treat every production failure as a golden set addition. Every incident where the system produced a wrong or harmful output should generate at least one new golden set entry. This ensures the evaluation set evolves with the system’s actual failure modes.
Gate deployments on quality metrics, not just engineering tests. A system that passes unit tests but regresses on RAGAS Faithfulness by more than 2σ should not ship. This requires buy-in from product and engineering leadership and must be established as policy before the first deployment.

References

[1]

Zheng, L., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023. arXiv:2306.05685

[2]

Es, S., et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv:2309.15217

[3]

Guo, Z., et al. (2023). Evaluating Large Language Models: A Comprehensive Survey. arXiv:2310.19736

[4]

Honovich, O., et al. (2022). TRUE: Re-evaluating Factual Consistency Evaluation. NAACL 2022. arXiv:2204.04991

[5]

Sellam, T., Das, D., & Parikh, A. (2020). BLEURT: Learning Robust Metrics for Text Generation. ACL 2020. arXiv:2004.04696

[6]

Liang, P., et al. (2022). Holistic Evaluation of Language Models (HELM). arXiv:2211.09110

Next Read

Data Vault 2.0 for Machine Learning

Continue to the next technical note on data architecture, modelling discipline, and how robust data foundations support scalable machine learning systems.

Part of Research & Documentation

Previous note · Back to all notes · Read next →