ETL Architecture

The pipeline that extracts data from source systems, transforms it, and loads it into a data warehouse.

Last updated: 2026-04-12

Overview

ETL (Extract, Transform, Load) — or ELT in modern cloud architectures — is the process that moves data from operational source systems into a data-warehouse. The ETL layer is the engineering backbone of a dimensional model: it enforces data quality, resolves keys, applies SCD logic, and populates both dimension and fact tables.

Data Areas / Zones

Large enterprise DWH architectures typically define distinct data areas:

LayerNamePurpose
LZALanding Zone AreaRaw extracted data, unmodified — exact copy of source
IDAIntegration / Staging AreaCleansed, deduplicated, integrated across sources
ARAAccess / Reporting AreaDimensional model (star schemas) — what BI tools query
DSAData Science AreaData science / ML feature store, looser structure
DMAData Mart AreaSubject-specific views or aggregates derived from ARA

Not all architectures use all layers. Simpler systems may use just staging → DWH → data mart.

ETL vs ELT

ETLELT
Transform locationDedicated ETL server/toolInside the DWH (SQL/dbt)
Typical eraPre-cloudCloud (Snowflake, BigQuery, Redshift)
ToolsInformatica, SSIS, Talenddbt, Spark, stored procedures

Extract

  • Pull from operational databases (CDC or full extract), APIs, flat files, streaming events
  • Land in LZA unchanged — preserves auditability
  • Delta detection: timestamp-based, CDC log-based, or checksum comparison

Transform

Key transformations for dimensional modeling:

  • Cleansing: Null handling, deduplication, format standardization
  • Surrogate key assignment: Lookup existing dimension keys; mint new surrogate keys for new entities
  • SCD processing: Detect attribute changes; apply the appropriate SCD type logic (close old row, insert new, etc.)
  • Conforming: Map source codes to enterprise-standard values (e.g., “US”, “USA”, “United States” → “United States”)
  • Grain validation: Ensure fact records match the declared grain
  • Late-arriving data: Handling facts that arrive after the period closes

Load

  • Dimension tables first: Fact tables reference dimension surrogate keys, so dimensions must exist before facts
  • Upsert patterns: For Type 2 SCD, must close existing row and insert new one atomically
  • Partition switching: Common bulk-load pattern for large fact tables in SQL Server/Synapse
  • Aggregate maintenance: Refresh aggregate fact tables after base fact table loads

Data Quality Dimensions

Quality is measured across multiple axes (ISO 8000 / DAMA):

  • Completeness: Are required values present?
  • Accuracy: Does data reflect reality?
  • Consistency: Is the same entity represented the same way across systems?
  • Timeliness: Is data available when needed?
  • Validity: Does data conform to defined formats and ranges?
  • Uniqueness: Are there duplicate records?

Connections

Sources