Your client wants a customer churn prediction model. The data exists โ CRM records, usage logs, support tickets, billing history โ but it is scattered across 7 systems with no unified view. The CRM uses customer IDs. The billing system uses account numbers. The usage logs use session tokens. Connecting these datasets requires 3 months of data engineering before you can even begin model development. The AI project is actually a data engineering project with an AI project attached.
Data warehouse design for AI projects is fundamentally different from traditional analytics warehousing. Analytics warehouses are optimized for reporting โ aggregated views, summary tables, and dashboards. AI warehouses must support feature engineering, training data generation, and inference data pipelines โ requiring row-level detail, historical snapshots, and high-throughput data access patterns that analytics warehouses often do not provide.
Why Data Architecture Matters for AI
Feature Engineering Demands
Machine learning models consume features โ numerical, categorical, and temporal attributes derived from raw data. Feature engineering requires access to granular, historical data with complex transformations โ time-windowed aggregations, cross-entity joins, and sequential pattern extraction. A data warehouse designed for analytics (pre-aggregated, summarized) may lack the granularity needed for feature engineering.
Training Data Requirements
Model training requires large volumes of labeled historical data with consistent schemas and quality. The data warehouse must support efficient extraction of training datasets โ potentially millions of rows spanning months or years of history โ with the ability to reproduce exactly the same dataset for experiment reproducibility.
Inference Data Pipelines
Production AI systems need data pipelines that deliver fresh features for real-time or batch inference. The warehouse must support efficient extraction of current feature values for production scoring โ often with latency requirements measured in seconds or minutes.
Design Principles for AI-Ready Data Warehouses
Preserve Granularity
Raw data retention: Retain raw, event-level data alongside aggregated views. Analytics may need only daily summaries, but AI needs individual events โ each click, each transaction, each support interaction.
Immutable event logs: Store events as immutable records with timestamps. This supports temporal feature engineering โ "What was the customer's behavior in the 7 days before churn?" โ which requires reconstructing historical states.
Slowly changing dimensions: Implement slowly changing dimension (SCD) patterns for entity attributes that change over time โ customer tier, subscription level, geographic region. AI models need to know what the attribute was at the time of each event, not just the current value.
Entity Resolution
Unified identity: Create a unified entity resolution layer that maps identifiers across systems. Customer ID in CRM, account number in billing, and session token in usage logs should all resolve to a single entity. Without entity resolution, cross-system feature engineering is impossible.
Identity graph: For complex entity relationships โ customers who belong to companies that have divisions โ build an identity graph that captures the relational structure. AI models that leverage organizational context need this relational information.
Schema Design
Star schema with extensions: Use a star schema (facts and dimensions) as the base design, extended with additional patterns for AI workloads. Facts contain events; dimensions contain entity attributes. Add feature tables that store pre-computed features for efficient access.
Time-series optimization: For time-series data (usage metrics, sensor data, financial data), use time-series-optimized storage and partitioning. Time-series queries are the most common access pattern for feature engineering.
Versioned schemas: Plan for schema evolution. Data sources change, new features are added, and data quality requirements evolve. The warehouse design should accommodate schema changes without breaking existing pipelines.
Data Quality Infrastructure
Quality checks at ingestion: Validate data quality at the point of ingestion โ schema validation, null checks, range checks, and referential integrity. Catching quality issues early prevents them from propagating into training data and model predictions.
Data quality monitoring: Continuous monitoring of data quality metrics โ completeness, freshness, accuracy, and consistency. Alert when quality metrics degrade below thresholds.
Data lineage: Track data lineage from source through transformation to consumption. When a model produces unexpected results, lineage enables you to trace back to the source data and identify where issues were introduced.
Technology Selection
Cloud Data Warehouses
Snowflake: Strong separation of storage and compute, excellent for variable workloads. Good support for semi-structured data. Integration with ML tools through Snowpark.
BigQuery: Google Cloud's serverless warehouse. Strong for large-scale analytics and ML workloads. Native integration with Vertex AI for model training.
Redshift: AWS's data warehouse. Deep integration with SageMaker and the AWS ML ecosystem. Good for teams already invested in AWS.
Databricks Lakehouse: Combines data warehouse and data lake capabilities. Native support for ML workloads, experiment tracking, and feature stores. Strong choice when ML is a primary workload.
Key Selection Criteria for AI
ML integration: How well does the warehouse integrate with ML tools โ training frameworks, feature stores, and model serving? Native integrations reduce pipeline complexity.
Data access patterns: Does the warehouse support the access patterns AI workloads need โ large sequential reads for training, point lookups for inference, and complex joins for feature engineering?
Cost at scale: AI workloads are data-intensive. Evaluate cost at the data volumes and query patterns your AI projects will generate, not just initial proof-of-concept scale.
Real-time capability: If your AI systems require real-time or near-real-time features, evaluate the warehouse's streaming data ingestion and low-latency query capabilities.
Implementation for AI Projects
Phase 1 โ Data Integration
Source inventory: Catalog all data sources relevant to the AI initiative โ databases, APIs, file exports, streaming sources. Document data format, update frequency, volume, and access method for each source.
ETL/ELT pipelines: Build data ingestion pipelines that extract data from sources, transform it into the warehouse schema, and load it into the warehouse. Use modern ELT approaches (load raw data, then transform in the warehouse) for flexibility.
Incremental loading: Implement incremental data loading rather than full refreshes. AI workloads generate large data volumes; full refreshes become impractical at scale.
Phase 2 โ Feature Store Integration
Feature computation: Build transformation pipelines that compute ML features from raw warehouse data. These features power both training and inference.
Feature store: Implement a feature store โ a centralized repository of pre-computed features that serves both offline (training) and online (inference) workloads. Feature stores prevent duplicate feature computation and ensure consistency between training and production.
Feature versioning: Version features so that training data can be reproduced exactly. When you retrain a model, you need the exact features that were available at training time, not the current feature definitions.
Phase 3 โ Training Data Management
Training dataset generation: Build pipelines that generate labeled training datasets from warehouse data. These pipelines should be parameterizable โ date ranges, entity filters, feature selections โ to support experiment iteration.
Dataset versioning: Version training datasets alongside model versions. When a model performs unexpectedly, the ability to examine the exact training data used is essential for debugging.
Data splits: Implement consistent train/validation/test split logic that prevents data leakage across splits. Time-based splits are often appropriate for AI applications โ train on historical data, validate on recent data, test on the most recent data.
Client Delivery Considerations
Discovery Questions
During project discovery, ask specific data architecture questions.
"Where does your data live today?" "How is data integrated across systems?" "What is the granularity of your historical data?" "How fresh does data need to be for the AI system?" "Who owns the data infrastructure and what are the change processes?"
Setting Expectations
Data engineering timeline: Be transparent that data engineering is often the longest phase of an AI project. Clients expect to start building models immediately โ educate them that solid data foundations are prerequisite.
Ongoing maintenance: Data warehouses require ongoing maintenance โ monitoring, optimization, schema evolution, and quality management. Include maintenance considerations in project scoping and post-delivery support agreements.
The data warehouse is the foundation that determines whether your AI project succeeds or fails. Models built on solid data foundations are accurate, maintainable, and reliable. Models built on ad-hoc data pipelines are fragile, unreproducible, and eventually abandoned. Invest the time to design the data architecture properly, and every subsequent AI initiative built on that foundation becomes faster and more reliable.