ETL Architecture
The pipeline that extracts data from source systems, transforms it, and loads it into a data warehouse.
Last updated: 2026-04-12
Overview
ETL (Extract, Transform, Load) — or ELT in modern cloud architectures — is the process that moves data from operational source systems into a data-warehouse. The ETL layer is the engineering backbone of a dimensional model: it enforces data quality, resolves keys, applies SCD logic, and populates both dimension and fact tables.
Data Areas / Zones
Large enterprise DWH architectures typically define distinct data areas:
| Layer | Name | Purpose |
|---|---|---|
| LZA | Landing Zone Area | Raw extracted data, unmodified — exact copy of source |
| IDA | Integration / Staging Area | Cleansed, deduplicated, integrated across sources |
| ARA | Access / Reporting Area | Dimensional model (star schemas) — what BI tools query |
| DSA | Data Science Area | Data science / ML feature store, looser structure |
| DMA | Data Mart Area | Subject-specific views or aggregates derived from ARA |
Not all architectures use all layers. Simpler systems may use just staging → DWH → data mart.
ETL vs ELT
| ETL | ELT | |
|---|---|---|
| Transform location | Dedicated ETL server/tool | Inside the DWH (SQL/dbt) |
| Typical era | Pre-cloud | Cloud (Snowflake, BigQuery, Redshift) |
| Tools | Informatica, SSIS, Talend | dbt, Spark, stored procedures |
Extract
- Pull from operational databases (CDC or full extract), APIs, flat files, streaming events
- Land in LZA unchanged — preserves auditability
- Delta detection: timestamp-based, CDC log-based, or checksum comparison
Transform
Key transformations for dimensional modeling:
- Cleansing: Null handling, deduplication, format standardization
- Surrogate key assignment: Lookup existing dimension keys; mint new surrogate keys for new entities
- SCD processing: Detect attribute changes; apply the appropriate SCD type logic (close old row, insert new, etc.)
- Conforming: Map source codes to enterprise-standard values (e.g., “US”, “USA”, “United States” → “United States”)
- Grain validation: Ensure fact records match the declared grain
- Late-arriving data: Handling facts that arrive after the period closes
Load
- Dimension tables first: Fact tables reference dimension surrogate keys, so dimensions must exist before facts
- Upsert patterns: For Type 2 SCD, must close existing row and insert new one atomically
- Partition switching: Common bulk-load pattern for large fact tables in SQL Server/Synapse
- Aggregate maintenance: Refresh aggregate fact tables after base fact table loads
Data Quality Dimensions
Quality is measured across multiple axes (ISO 8000 / DAMA):
- Completeness: Are required values present?
- Accuracy: Does data reflect reality?
- Consistency: Is the same entity represented the same way across systems?
- Timeliness: Is data available when needed?
- Validity: Does data conform to defined formats and ranges?
- Uniqueness: Are there duplicate records?
Connections
- dimensional-modeling — the target model ETL populates
- slowly-changing-dimensions — the most complex transform logic in ETL
- star-schema — the schema structure ETL loads into
- data-warehouse — the system ETL feeds
Sources
- Dimensional Modeling — Joakim Dalby — added 2026-04-12