Data Vault 2.0 (DV2) was designed for enterprise data warehousing: scalable, auditable, agile in the face of changing source systems. Machine learning has different requirements — it needs point-in-time consistent features, training-serving skew elimination, and reproducible dataset construction from historical data. At first glance these methodologies appear to conflict. In practice, we have found that a well-implemented Data Vault serves as an exceptionally strong foundation for enterprise ML data infrastructure — but only when the integration is designed deliberately, with ML requirements in mind at the architectural level, not as an afterthought. This paper documents our operational experience applying DV2 to ML data platforms across multiple enterprise deployments, describing the patterns that work, the anti-patterns to avoid, and the specific engineering required to deliver ML-ready features from a Data Vault foundation on both Databricks and Snowflake.
The ML Data Problem in the Enterprise
Enterprise machine learning fails more often at the data layer than at the model layer. This is not a new observation — the "garbage in, garbage out" principle is as old as computing — but the specific ways in which enterprise data infrastructure causes ML projects to fail are not always well understood.
The most common failure mode is training-serving skew: the features used to train a model are computed differently, or from different data sources, than the features served at inference time. The model is evaluated on training data and appears to perform well; in production, the feature distribution is subtly different and performance degrades in ways that are difficult to attribute and fix.
The second common failure mode is temporal leakage: training data inadvertently includes information from the future relative to the label. A credit risk model trained with account balance as of the day the loan was labelled as default — rather than as of the day the loan was originated — will dramatically overestimate the predictive value of account balance and produce a model that fails systematically in production.
The third is reproducibility failure: the inability to recreate a training dataset exactly as it was at a specific historical point in time, which makes debugging, auditing, and regulatory compliance impossible.
Data Vault 2.0, implemented correctly with ML requirements in mind, eliminates all three failure modes: it provides full historical data availability for point-in-time consistent feature construction, a single source of truth that eliminates training-serving skew, and complete data lineage for reproducibility and governance.
Data Vault 2.0: A Brief Primer
2.1 Hubs, Links, and Satellites
Data Vault 2.0 organises a data warehouse into three entity types. Hubs contain the business keys — the natural identifiers that represent core business entities (CustomerID, ProductSKU, AccountNumber). Each Hub contains only the business key, a hash key surrogate, a load date, and a record source. Hubs never change after insertion. Links represent relationships between Hubs — the association between a Customer and an Account, or between an Order and a Product. Like Hubs, Links are insert-only. Satellites contain all descriptive attributes of Hubs or Links — the attributes that change over time, with full history tracked via load date and end date columns. Satellites are where all the interesting data lives and where ML features are derived.
2.2 Why DV2 for ML? The Case
The properties of Data Vault that make it valuable for BI also make it valuable for ML, but the specific benefits manifest differently. The insert-only, full-history tracking of Satellites is the most critical property: it means that for any business entity, the complete history of every attribute value with its exact load timestamp is preserved. This is the raw material for point-in-time consistent feature construction — something that star schemas and third normal form databases cannot provide without time-travel extensions.
The strict separation of business keys (Hubs) from relationships (Links) from attributes (Satellites) also produces a data model that is easy to audit, document, and trace — properties that are increasingly required by ML governance and regulatory compliance frameworks.
Point-in-Time Consistency for ML
3.1 The Data Leakage Problem
Data leakage occurs when a training dataset contains information that would not have been available at the time a prediction would be made in production. It is the single most common cause of the "model works in training, fails in production" pattern that plagues enterprise ML projects. Leakage can be dramatic and obvious or subtle and insidious.
The correct approach is to construct features using only data that was available at or before a defined as-of date for each training example. In a traditional data warehouse, this is difficult: data is typically stored in its current state, and historical values require either slowly changing dimension Type 2 tables or separate historical snapshots. In Data Vault, the Satellite's load and end date history provides this capability natively.
3.2 PIT Tables in Data Vault
Point-in-Time (PIT) tables are an auxiliary structure in Data Vault that pre-computes, for a given set of Satellites and a given set of as-of dates, the correct Satellite record for each Hub entity at each date. Rather than performing expensive temporal joins at feature computation time, PIT tables materialise the join keys once and allow feature queries to be simple lookups.
-- Build PIT table: for each customer × as_of_date, -- find the most recent Satellite record on or before that date. CREATE TABLE pit_customer_profile AS SELECT dt.customer_hash_key, dt.as_of_date, -- Profile satellite: last record on or before as_of_date MAX(sp.load_dts) FILTER (WHERE sp.load_dts <= dt.as_of_date) AS sat_profile_ldts, -- Transaction satellite: last record on or before as_of_date MAX(st.load_dts) FILTER (WHERE st.load_dts <= dt.as_of_date) AS sat_transactions_ldts FROM training_dates dt LEFT JOIN sat_customer_profile sp ON sp.customer_hash_key = dt.customer_hash_key LEFT JOIN sat_customer_transactions st ON st.customer_hash_key = dt.customer_hash_key GROUP BY dt.customer_hash_key, dt.as_of_date;
With a PIT table, feature extraction reduces to a set of simple joins using the pre-computed Satellite record pointers, eliminating the need for complex temporal range joins at feature computation time and ensuring temporal consistency is guaranteed by construction.
Feature Engineering from DV2 Satellites
Satellites are the richest source of ML features in a Data Vault. The temporal history in each Satellite enables not just point-in-time lookups of attribute values, but also the computation of temporal features: changes over time, trends, rates, and window aggregates — all computed consistently relative to the as-of date.
| Feature Type | DV2 Source | Computation Pattern | Example |
|---|---|---|---|
| Static attribute | Hub + Satellite | PIT lookup → direct column | customer_segment as of prediction date |
| Point-in-time numeric | Satellite | PIT lookup → numeric column | credit_score as of 30 days prior to event |
| Temporal delta | Satellite (2 snapshots) | PIT at t₀, PIT at t₋ₙ → subtraction | change in balance over 90 days |
| Rolling window aggregate | Link + Satellite | PIT range → GROUP BY + aggregate | sum of transactions in prior 30 days |
| Count of state changes | Satellite history | COUNT DISTINCT load_dts in window | number of address changes in 12 months |
| Relationship feature | Link + multiple Sats | Graph traversal → aggregate | average credit score of household members |
Training-Serving Skew Elimination
Training-serving skew is eliminated when training features and serving features are computed from the same code on the same data source. Data Vault supports this through two mechanisms. First, the Satellite-based feature computation logic is encapsulated as SQL views or dbt models that can be called from both the training pipeline (with a historical as-of date) and the serving pipeline (with CURRENT_TIMESTAMP as the as-of date). The code is identical; only the temporal parameter changes.
Second, the strict insert-only, append-only nature of Satellites means that the data available at serving time is a superset of the data available at training time — there are no in-place updates that could cause historical records to change after they were used for training. This immutability guarantee is a property of Data Vault that star schemas and most ODS designs cannot provide.
Feature logic should exist exactly once. Training and serving should call the same function or query with different temporal parameters. Any deviation — a copy of the feature logic maintained separately for training, serving, or monitoring — will eventually produce skew. This is an architectural constraint that must be enforced at the platform level, not just as a team convention.
Data Vault on the Lakehouse
6.1 Databricks Implementation
Databricks (Delta Lake) is the most natural platform for Data Vault ML implementations. Delta Lake's ACID transactions and time-travel capability complement Data Vault's insert-only model: even though DV2 does not rely on time-travel for correctness, Delta's time-travel provides an additional safety net and simplifies point-in-time query patterns for some feature types.
The recommended storage format is Delta tables with Liquid Clustering on the hash_key and load_dts columns, enabling efficient temporal queries without the overhead of traditional partition pruning. AutoOptimize and AutoCompact should be enabled on all Satellite tables to manage the small-file problem that arises from frequent insert-only loads.
6.2 Snowflake Implementation
Snowflake's Time Travel feature provides an alternative mechanism for point-in-time feature extraction from Satellite tables, and Streams enable incremental Satellite loading patterns. The primary implementation challenge on Snowflake is clustering: Satellite tables queried by temporal range require clustering on (hash_key, load_dts) to avoid full table scans.
For large Satellite tables, we recommend explicit clustering keys and materialised PIT tables refreshed on a schedule aligned with the training pipeline cadence, rather than computing PIT joins dynamically at training time.
Feature Store Integration
The Data Vault is not a feature store. It is the upstream data source from which features are derived. A feature store sits between the Data Vault and the ML training and serving pipeline, providing online low-latency feature retrieval, offline batch feature retrieval for training, feature versioning and metadata management, and training-serving consistency guarantees enforced at the platform level.
Data Quality for ML Reliability
A model is only as reliable as the data it was trained on and the data it is served. Data quality failures at training time produce biased or underperforming models; data quality failures at serving time produce unreliable predictions with no warning. A comprehensive data quality framework must operate at both stages.
We implement a three-layer data quality framework. Schema validation catches structural failures: missing columns, wrong data types, unexpected nulls in non-nullable columns. Statistical validation catches distributional failures: feature means or standard deviations outside expected ranges, unusual null rates, cardinality anomalies. Temporal validation catches freshness failures: Satellites that have not received new records within the expected interval, which may indicate a broken ingestion pipeline.
For each validation type, we define three severity levels: error, warning, and info. Great Expectations and dbt tests are our preferred implementation tools; the validation results are stored in the Satellite metadata layer for lineage tracking.
Data Lineage for Model Governance
Regulatory compliance in financial services, healthcare, and other regulated industries increasingly requires ML models to be fully explainable in terms of their training data provenance. "Which data was used to train this model?" is a question that must be answerable with complete specificity: which source systems, which tables, which records, as of which dates.
Data Vault's built-in rec_src and load_dts metadata on every record provides the foundation for complete data lineage from source system to trained model. When combined with MLflow experiment tracking, the lineage chain from raw source data to deployed model weight is fully documented and auditable.
Empirical Results and Lessons Learned
We report results from a credit risk ML platform migration from a traditional star schema to a Data Vault 2.0 + Feature Store architecture on Databricks Delta Lake, serving a portfolio of 12 production ML models.
| Metric | Before (Star Schema) | After (DV2 + Feature Store) |
|---|---|---|
| Training-serving skew incidents (per quarter) | 6–9 | 0 |
| Time to reproduce historical training dataset | 3–14 days | 45 minutes |
| Data leakage incidents detected post-deployment | 3 in 18 months | 0 |
| Time to add a new feature to production | 4–6 weeks | 3–5 days |
| Model AUC improvement (avg, 12 models) | — | +0.034 |
| Data pipeline SLA compliance | 81% | 99.2% |
The +0.034 average AUC improvement across 12 models is attributable to the elimination of data leakage and the ability to construct more precise temporal features: features that had previously been approximate are now constructed precisely and consistently from PIT tables.
Anti-Patterns and What Not to Do
- Querying Satellites directly from training code without PIT tables. This is the most common mistake. It appears to work, but it silently produces temporally inconsistent feature sets because the join between Satellites of different update frequencies is not temporally anchored. Always build PIT tables as the temporal join layer.
- Using Delta time-travel as a substitute for Satellite history. Delta time-travel is a convenient feature but it is limited and its semantics differ from DV2's explicit load_dts tracking. Use Delta time-travel as a debugging and safety tool, not as the primary mechanism for point-in-time feature extraction.
- Putting feature engineering logic in the Raw Vault. The Raw Vault should load data exactly as received from source systems with no business transformations. Feature engineering belongs in the Business Vault or the Feature Store layer.
- Skipping the Feature Store and serving directly from the Data Vault. The Data Vault is optimised for historical query and batch workloads. Online inference at low latency requires a separate online store with a different access pattern.
- Failing to track feature computation code alongside data lineage. Knowing which Satellite records contributed to a training dataset is necessary but not sufficient for reproducibility. You must also know which version of the feature computation code transformed those records into model features.