×
Home Services Research & Documentation Careers Work With Us
VedhaAI Research · Technical Note No. 3

Evaluation Pipelines for Enterprise LLM Systems

VedhaAI Engineering

VedhaAI Inc. · Toronto, Ontario, Canada

Abstract

Deploying LLMs in production without a systematic evaluation framework is engineering malpractice. Without measurement, teams cannot distinguish improvements from regressions, cannot detect silent quality degradation over time, and cannot make principled decisions about model updates. This paper describes a comprehensive evaluation pipeline architecture for enterprise LLM systems, covering: golden dataset construction and maintenance, automated reference-based and reference-free scoring metrics, LLM-as-judge evaluation patterns, hallucination and faithfulness assessment, human evaluation workflows, and integration with CI/CD pipelines for continuous quality assurance. We draw on operational experience from multiple production deployments and provide concrete implementation guidance, benchmark comparisons, and failure mode analysis. We demonstrate that teams with mature evaluation pipelines ship higher-quality systems faster and with greater confidence than teams that rely on ad hoc testing.

§ 1

Why Evaluation is Hard for LLMs

Evaluating traditional ML systems is relatively straightforward: you hold out a test set, run inference, and compute a well-defined metric (accuracy, AUC, F1, RMSE). The metric is objective, reproducible, and correlates meaningfully with business outcome. LLM evaluation has none of these properties by default.

The fundamental challenge is that LLM outputs are open-ended and high-dimensional. Two responses that would receive identical ratings from a human expert — both correct, both clearly expressed, both appropriately cited — may differ in every surface feature: vocabulary, sentence structure, length, formatting. Metrics that compare surface form (BLEU, ROUGE) systematically undervalue paraphrase and overvalue verbatim repetition, making them poor proxies for actual quality.

A second challenge is task heterogeneity. A single production RAG system may handle factual questions, multi-step reasoning queries, summarisation requests, and structured data extraction — each requiring different evaluation criteria and metrics. No single metric captures quality across all task types.

The Fundamental Evaluation Principle

Evaluation quality determines the speed of system improvement. A team with a high-quality evaluation pipeline can safely iterate 5–10× faster than a team without one, because every change can be validated quantitatively before deployment.

§ 2

The Evaluation Stack: A Taxonomy

A complete LLM evaluation stack operates at four levels, each measuring different properties and operating at different cadences.

LevelWhat it measuresMethodCadence
UnitIndividual component quality (retriever, reranker, generator)Component-level golden sets, automated metricsEvery commit
IntegrationEnd-to-end pipeline quality on defined task distributionRAGAS, LLM-as-judge, reference-based metricsPre-deployment, weekly
RegressionQuality changes relative to previous version or baselineA/B comparison on golden set, statistical significance testingEvery release candidate
ProductionLive quality signals, distribution drift, edge case discoveryHuman sampling, implicit feedback, embedding drift detectionContinuous
Table 1. Four-level LLM evaluation taxonomy. Teams frequently implement production monitoring before unit testing, which is backwards: production monitoring catches failures after they affect users, while unit evaluation prevents them from shipping.
§ 3

Golden Dataset Construction

3.1 What Belongs in a Golden Set

A golden dataset is a curated collection of (input, expected output) pairs that represents the task distribution the system will face in production, with sufficient coverage of edge cases, difficult queries, and failure modes that have been observed or anticipated. The minimum viable golden set for a production RAG system has three components:

  • Core competency queries (60%): questions that a well-functioning system should answer correctly with high confidence. These establish the baseline.
  • Edge cases and adversarial queries (25%): ambiguous questions, out-of-domain queries, questions with no answer in the corpus, queries designed to elicit hallucination. These are the queries that distinguish a robust system from a brittle one.
  • Regression tests (15%): specific (query, failure) pairs from production incidents. Every production failure should generate at least one regression test.

3.2 Annotation Guidelines

Annotation quality determines evaluation quality. Vague annotation guidelines produce inconsistent labels that reduce the signal-to-noise ratio of the golden set. For enterprise RAG systems, we recommend annotating each golden pair with: a ground-truth answer, a list of acceptable paraphrases, a list of required facts that must be present in a correct answer, a list of forbidden claims (common hallucinations or near-miss errors), and a relevance label for each candidate passage in the corpus.

The relevance labels for corpus passages are particularly valuable: they enable retrieval evaluation to be conducted independently of generation evaluation, allowing the two subsystems to be improved separately.

3.3 Dataset Versioning and Drift

A golden dataset is a living artefact that must be versioned and curated over time. As the underlying corpus changes, as new query patterns emerge in production, and as the task definition evolves, the golden set must be updated. We maintain golden sets in version-controlled repositories alongside the system code, with every golden set change logged with a justification. Ratchet metrics — where each release must equal or exceed the previous release's score — are enforced as CI/CD gates.

§ 4

Reference-Based Automated Metrics

Reference-based metrics compare model outputs to one or more reference answers. They are fast, cheap, and reproducible — and they have well-documented limitations that must be understood before applying them.

MetricWhat it measuresStrengthsLimitations
ROUGE-LLongest common subsequence overlapFast, interpretable, good for extractive tasksPenalises paraphrase; poor for abstractive generation
BERTScoreContextual embedding similarityHandles paraphrase; correlates better with human judgementComputationally heavier; model-dependent
Exact MatchCharacter-level identityUnambiguous for structured outputsToo strict for free-form generation
F1 (token-level)Token overlap between prediction and referenceStandard SQuAD metric; interpretableSurface-form bias; misses semantic equivalence
BLEURTLearned similarity to human ratingsBest correlation with human judgementRequires reference; domain-sensitive
Table 2. Reference-based metric comparison. For most enterprise LLM evaluation tasks, BERTScore or BLEURT provides better signal than ROUGE-L, despite higher computational cost. Use ROUGE-L only where token overlap is meaningful (extractive summarisation, structured information extraction).
§ 5

Reference-Free Evaluation

5.1 LLM-as-Judge

Reference-free evaluation using a strong LLM as a judge has become the most practical approach for evaluating open-ended generation in enterprise settings[1]. Rather than comparing outputs to a reference answer, the judge LLM receives the query, the context, and the response and produces a quality rating on a defined rubric.

The key to reliable LLM-as-judge evaluation is a precise, detailed rubric that leaves minimal room for interpretation. The rubric should define exactly what constitutes a score of 1, 2, 3, 4, and 5 on each dimension (factual accuracy, answer completeness, citation quality, clarity) with concrete examples. Without a precise rubric, LLM judges exhibit significant positional bias, verbosity bias, and self-preference bias that corrupt the evaluation signal.

LLM Judge Bias Mitigation

Mitigation strategies for known biases in LLM-as-judge evaluation: (a) use a judge model from a different family than the generator being evaluated; (b) always use chain-of-thought reasoning before the final score; (c) run pairwise comparisons rather than absolute scoring where possible; (d) calibrate against a set of human-labelled examples before deploying the judge.

5.2 RAGAS Framework

RAGAS[2] (Retrieval-Augmented Generation Assessment) provides a reference-free evaluation framework specifically designed for RAG systems. It computes four metrics: Faithfulness (are all claims in the response supported by the retrieved context?), Answer Relevance (how well does the response address the query?), Context Precision (are the retrieved passages relevant to the query?), and Context Recall (do the retrieved passages cover the expected answer?).

RAGAS Score = harmonic_mean(Faithfulness, Answer_Relevance, Context_Precision, Context_Recall)
Individual dimensions are more informative than the composite score for diagnosing specific failure modes.

RAGAS is particularly valuable for diagnosing where in the pipeline quality is being lost. A system with high Context Precision but low Faithfulness indicates that retrieval is working but generation is hallucinating. A system with low Context Recall indicates that the retriever is not surfacing relevant passages, regardless of generation quality.

§ 6

Hallucination and Faithfulness Assessment

Hallucination is the production failure mode with the highest business consequence in enterprise RAG deployments. A customer support system that confidently provides incorrect policy information, a legal system that cites non-existent cases, or a financial system that fabricates regulatory requirements — these failures erode trust and create liability in ways that latency or availability failures do not.

We distinguish between two types of hallucination in the RAG setting. Intrinsic hallucination occurs when the model generates content that contradicts the retrieved passages. Extrinsic hallucination occurs when the model generates content that is neither supported nor contradicted by the retrieved passages — content generated from parametric memory rather than the provided context.

The most reliable automated hallucination detection approach is NLI-based faithfulness checking: each sentence in the generated response is evaluated against each retrieved passage using a natural language inference model. Sentences that are classified as contradiction relative to any retrieved passage are flagged as intrinsic hallucinations; sentences classified as neutral relative to all retrieved passages are flagged as potential extrinsic hallucinations.

Python · Sentence-Level Faithfulness Scoring
from transformers import pipeline

nli = pipeline(
    "text-classification",
    model="cross-encoder/nli-deberta-v3-base"
)

def faithfulness_score(response: str, contexts: list[str]) -> dict:
    sentences = split_sentences(response)
    results = {}

    for sent in sentences:
        # Check sentence against each retrieved context
        labels = []
        for ctx in contexts:
            r = nli(f"{ctx} [SEP] {sent}")[0]
            labels.append(r["label"])

        # Contradiction in ANY context = hallucination flag
        if "CONTRADICTION" in labels:
            results[sent] = "hallucinated"
        elif "ENTAILMENT" not in labels:
            results[sent] = "unsupported"
        else:
            results[sent] = "grounded"

    return results
§ 7

Human Evaluation Workflows

Automated metrics are necessary but not sufficient. Human evaluation catches systematic failures that automated metrics miss, provides ground truth for calibrating automated metrics, and ensures the system is actually solving the business problem it was designed for. The challenge is that human evaluation is expensive, slow, and can be inconsistent without careful workflow design.

We recommend a tiered human evaluation approach. Expert evaluation by domain specialists is conducted on a small sample (30–50 examples) before each major release, with annotators using a detailed rubric and spending 10–15 minutes per example. Crowd evaluation using non-specialist annotators is conducted on larger samples (200–500 examples) for preference ranking tasks where domain expertise is not required. User evaluation captures implicit signals from real users through thumbs-up/thumbs-down buttons, query reformulation patterns, and session abandonment rates.

Inter-annotator agreement should be measured on every batch of human evaluation. A Cohen's κ below 0.6 indicates the annotation task is underspecified — the rubric is ambiguous and the human evaluation data is unreliable regardless of sample size.

§ 8

CI/CD Integration and Regression Gating

The evaluation pipeline must be integrated into the CI/CD workflow as a blocking gate for production deployments. A system that passes unit tests and integration tests but degrades on the LLM evaluation golden set should not be promoted to production — regardless of the absence of traditional engineering failures.

Code
Push
Unit Tests
component eval
Integration
RAGAS eval
Regression
Δ vs baseline
Staging
human sample
Production
canary deploy
Figure 1. LLM CI/CD pipeline with evaluation gates. Each gate is a hard blocker: a pipeline that degrades on regression testing does not advance to staging, regardless of other test results. Canary deployment routes 5% of traffic to the new version before full rollout, with automatic rollback triggered by production quality metric degradation.

Regression gating requires defining a ratchet: the minimum acceptable score on each metric, set equal to or above the current production score. We use a 95% confidence interval around the baseline score as the ratchet, computed via bootstrap resampling over the golden set, to account for metric variance on small evaluation sets.

§ 9

Production Monitoring and Drift Detection

Production monitoring for LLM systems is qualitatively different from monitoring for deterministic software. In addition to traditional infrastructure metrics (latency, error rate, throughput), you must monitor output quality metrics in real time.

The most practical approach is sampling-based monitoring: a fraction of production queries (typically 1–5%) is routed through the automated evaluation pipeline and scored in near-real-time. Control charts over the quality metrics detect degradation before it becomes visible to the majority of users. Alert thresholds are set at 2σ below the historical mean for each metric.

Embedding drift detection catches a subtler failure mode: the distribution of production queries shifting away from the distribution on which the system was evaluated. If production query embeddings begin drifting from the evaluation set embeddings, it is a signal that the golden set is becoming unrepresentative and must be updated. We monitor query embedding drift daily using Maximum Mean Discrepancy (MMD) between a rolling window of production queries and the golden set.

§ 10

Empirical Results

We compare two production RAG deployments — one with a mature evaluation pipeline implemented from day one, one where evaluation was retrofitted six months after launch — on several operational dimensions over a 12-month period.

Operational MetricEvaluation-First TeamEvaluation-Retrofit Team
Production incidents caused by silent quality regression19
Mean time to detect quality degradation4 hours6.2 days
Deployment frequency (releases/month)8.23.1
Time spent on manual QA per release2.4 hours18 hours
Final system quality (RAGAS composite, month 12)0.830.71
Table 3. Comparison of evaluation-first vs. evaluation-retrofit team operational outcomes over 12 months. The evaluation-first team shipped faster, caught regressions earlier, spent less time on manual QA, and achieved a higher final system quality score — demonstrating that evaluation infrastructure is not overhead, but an enabler of velocity.
§ 11

Recommendations

  • Build the golden set before writing production code. The discipline of defining what “correct” means before building forces clarity about the task that prevents expensive course corrections later.
  • Separate retrieval evaluation from generation evaluation. Track retrieval metrics (Recall@K, NDCG@K) and generation metrics (Faithfulness, Answer Relevance) independently. A regression in retrieval quality masked by a generation improvement is a pipeline that is becoming fragile.
  • Calibrate your automated metrics against human labels. Before trusting an LLM-as-judge or an NLI-based faithfulness score, measure its correlation with human judgements on 100+ examples. An uncalibrated automated metric optimising for the wrong thing is worse than no metric.
  • Treat every production failure as a golden set addition. Every incident where the system produced a wrong or harmful output should generate at least one new golden set entry. This ensures the evaluation set evolves with the system’s actual failure modes.
  • Gate deployments on quality metrics, not just engineering tests. A system that passes unit tests but regresses on RAGAS Faithfulness by more than 2σ should not ship. This requires buy-in from product and engineering leadership and must be established as policy before the first deployment.
References
[1]
Zheng, L., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023. arXiv:2306.05685
[2]
Es, S., et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv:2309.15217
[3]
Guo, Z., et al. (2023). Evaluating Large Language Models: A Comprehensive Survey. arXiv:2310.19736
[4]
Honovich, O., et al. (2022). TRUE: Re-evaluating Factual Consistency Evaluation. NAACL 2022. arXiv:2204.04991
[5]
Sellam, T., Das, D., & Parikh, A. (2020). BLEURT: Learning Robust Metrics for Text Generation. ACL 2020. arXiv:2004.04696
[6]
Liang, P., et al. (2022). Holistic Evaluation of Language Models (HELM). arXiv:2211.09110
Next Read

Data Vault 2.0 for Machine Learning

Continue to the next technical note on data architecture, modelling discipline, and how robust data foundations support scalable machine learning systems.

Part of Research & Documentation
Previous note · Back to all notes · Read next →