ETL Architecture

The pipeline that extracts data from source systems, transforms it, and loads it into a data warehouse.

Last updated: 2026-04-12

Overview

ETL (Extract, Transform, Load) — or ELT in modern cloud architectures — is the process that moves data from operational source systems into a data-warehouse. The ETL layer is the engineering backbone of a dimensional model: it enforces data quality, resolves keys, applies SCD logic, and populates both dimension and fact tables.

Data Areas / Zones

Large enterprise DWH architectures typically define distinct data areas:

Layer	Name	Purpose
LZA	Landing Zone Area	Raw extracted data, unmodified — exact copy of source
IDA	Integration / Staging Area	Cleansed, deduplicated, integrated across sources
ARA	Access / Reporting Area	Dimensional model (star schemas) — what BI tools query
DSA	Data Science Area	Data science / ML feature store, looser structure
DMA	Data Mart Area	Subject-specific views or aggregates derived from ARA

Not all architectures use all layers. Simpler systems may use just staging → DWH → data mart.

ETL vs ELT

	ETL	ELT
Transform location	Dedicated ETL server/tool	Inside the DWH (SQL/dbt)
Typical era	Pre-cloud	Cloud (Snowflake, BigQuery, Redshift)
Tools	Informatica, SSIS, Talend	dbt, Spark, stored procedures

Extract

Pull from operational databases (CDC or full extract), APIs, flat files, streaming events
Land in LZA unchanged — preserves auditability
Delta detection: timestamp-based, CDC log-based, or checksum comparison

Transform

Key transformations for dimensional modeling:

Cleansing: Null handling, deduplication, format standardization
Surrogate key assignment: Lookup existing dimension keys; mint new surrogate keys for new entities
SCD processing: Detect attribute changes; apply the appropriate SCD type logic (close old row, insert new, etc.)
Conforming: Map source codes to enterprise-standard values (e.g., “US”, “USA”, “United States” → “United States”)
Grain validation: Ensure fact records match the declared grain
Late-arriving data: Handling facts that arrive after the period closes

Load

Dimension tables first: Fact tables reference dimension surrogate keys, so dimensions must exist before facts
Upsert patterns: For Type 2 SCD, must close existing row and insert new one atomically
Partition switching: Common bulk-load pattern for large fact tables in SQL Server/Synapse
Aggregate maintenance: Refresh aggregate fact tables after base fact table loads

Data Quality Dimensions

Quality is measured across multiple axes (ISO 8000 / DAMA):

Completeness: Are required values present?
Accuracy: Does data reflect reality?
Consistency: Is the same entity represented the same way across systems?
Timeliness: Is data available when needed?
Validity: Does data conform to defined formats and ranges?
Uniqueness: Are there duplicate records?

Connections

dimensional-modeling — the target model ETL populates
slowly-changing-dimensions — the most complex transform logic in ETL
star-schema — the schema structure ETL loads into
data-warehouse — the system ETL feeds

Sources

Dimensional Modeling — Joakim Dalby — added 2026-04-12

second-brain

Explorer

etl-architecture