SQL Normalization and Star Schema: 1NF, 2NF, 3NF, Dimensional Modeling, and Designing Databases Like a Data Engineer
There are two ways to design a database. Normalization (1NF, 2NF, 3NF) eliminates redundancy — used in transactional systems where data is written frequently. Dimensional modeling (star schema, snowflake schema) optimizes for reading — used in data warehouses where analysts run complex queries.
As a data engineer, you work with BOTH. You read from normalized source databases (OLTP) and write to denormalized star schemas (OLAP/data warehouse). Understanding both designs — and knowing when to apply each — is fundamental.
Think of normalization like organizing a library with strict rules: every book appears exactly once, every author’s info is stored in one place, and changing an author’s name requires updating only one record. A star schema is like a pre-built reading room: the librarian has already gathered everything you might need for “19th century French literature” into one spot — even if some author info is duplicated across reading rooms. The library (normalized) is efficient for adding books. The reading room (star schema) is efficient for research.
Table of Contents
- OLTP vs OLAP — Two Different Worlds
- Normalization — Eliminating Redundancy
- First Normal Form (1NF)
- Second Normal Form (2NF)
- Third Normal Form (3NF)
- Denormalization — When to Break the Rules
- Dimensional Modeling — Star Schema
- Fact Tables — What Happened
- Dimension Tables — The Context
- Star Schema Design
- Snowflake Schema
- Star vs Snowflake
- Slowly Changing Dimensions in Star Schema
- Building a Star Schema: Complete Example
- Surrogate Keys vs Natural Keys
- Junk Dimensions, Degenerate Dimensions, Role-Playing Dimensions
- How This Maps to Our Blog
- Common Mistakes
- Interview Questions
- Wrapping Up
OLTP vs OLAP — Two Different Worlds
| Feature | OLTP (Transactional) | OLAP (Analytical) |
|---|---|---|
| Purpose | Record transactions | Analyze trends |
| Users | Applications, cashiers, websites | Analysts, dashboards, ML models |
| Queries | Simple (INSERT, UPDATE one row) | Complex (JOIN 5 tables, GROUP BY, SUM) |
| Design | Normalized (3NF) | Denormalized (Star Schema) |
| Redundancy | Eliminated (each fact stored once) | Intentional (faster reads) |
| Example | Banking app, e-commerce checkout | Revenue dashboard, churn analysis |
| Write speed | Optimized (fast inserts) | Not optimized (batch loads) |
| Read speed | Not optimized (many joins) | Optimized (fewer joins) |
As a data engineer, your job is to move data FROM OLTP (normalized) TO OLAP (star schema). That is what our Bronze → Silver → Gold pipelines do.
Normalization — Eliminating Redundancy
The Problem: Unnormalized Data
Orders Table (one big table — no normalization):
| order_id | customer_name | customer_email | customer_city | product_name | product_price | quantity | order_date |
|----------|-------------- |-------------------|---------------|-------------|---------------|----------|------------|
| 1 | Naveen | naveen@email.com | Toronto | Laptop | 999.99 | 1 | 2026-01-15 |
| 2 | Naveen | naveen@email.com | Toronto | Mouse | 29.99 | 2 | 2026-01-15 |
| 3 | Shrey | shrey@email.com | Mumbai | Laptop | 999.99 | 1 | 2026-01-16 |
| 4 | Naveen | naveen.new@email | Toronto | Keyboard | 79.99 | 1 | 2026-02-01 |
Problems: – Naveen’s info is duplicated across 3 rows – Row 4 has a different email for Naveen — which is correct? – Laptop price stored 2 times — update one, forget the other = inconsistency – Deleting order 3 loses all info about Shrey if it was his only order
First Normal Form (1NF) — Atomic Values
Rule: Every cell contains ONE value (not a list or repeating group).
❌ Violates 1NF (repeating group):
| order_id | customer | products |
|----------|----------|-----------------------|
| 1 | Naveen | Laptop, Mouse, Keyboard|
✅ Satisfies 1NF (one value per cell):
| order_id | customer | product |
|----------|----------|----------|
| 1 | Naveen | Laptop |
| 1 | Naveen | Mouse |
| 1 | Naveen | Keyboard |
Real-life analogy: 1NF is like a rule that every shelf in a bookshelf holds ONE book. No stacking books on top of each other. If you have 3 books, you need 3 shelves.
Second Normal Form (2NF) — No Partial Dependencies
Rule: Already in 1NF + every non-key column depends on the ENTIRE primary key (not just part of it).
❌ Violates 2NF (partial dependency):
Primary Key: (order_id, product_id)
| order_id | product_id | product_name | product_price | quantity |
|----------|-----------|-------------|---------------|----------|
| 1 | 101 | Laptop | 999.99 | 1 |
| 1 | 102 | Mouse | 29.99 | 2 |
product_name and product_price depend ONLY on product_id, not on the full key (order_id, product_id).
✅ Satisfies 2NF (separate tables):
Order_Items: (order_id, product_id, quantity)
Products: (product_id, product_name, product_price)
Real-life analogy: 2NF is like saying “the recipe should not contain the oven’s instruction manual.” The recipe (order_items) should only contain what is specific to THIS recipe. General oven instructions (product details) belong in a separate document (products table).
Third Normal Form (3NF) — No Transitive Dependencies
Rule: Already in 2NF + no non-key column depends on another non-key column.
❌ Violates 3NF (transitive dependency):
| emp_id | emp_name | dept_id | dept_name | dept_manager |
|--------|----------|---------|-------------|-------------|
| 1001 | Naveen | 10 | Engineering | Alice |
| 1002 | Shrey | 10 | Engineering | Alice |
dept_name and dept_manager depend on dept_id, not on emp_id.
Change manager? Must update EVERY employee in that department.
✅ Satisfies 3NF (separate tables):
Employees: (emp_id, emp_name, dept_id)
Departments: (dept_id, dept_name, dept_manager)
Real-life analogy: 3NF is like keeping your home address only on your driver’s license, not also on your credit card, gym membership, and library card. If you move, you update ONE place (driver’s license / departments table), not 5 places.
Summary: 1NF → 2NF → 3NF
| Normal Form | Rule | Eliminates |
|---|---|---|
| 1NF | Every cell has one atomic value | Repeating groups, multi-value cells |
| 2NF | 1NF + no partial dependencies on composite keys | Partial dependencies |
| 3NF | 2NF + no transitive dependencies (non-key → non-key) | Redundant data stored in multiple places |
Denormalization — When to Break the Rules
Normalization is great for WRITING (no duplicates, easy updates). But terrible for READING:
-- Normalized (3NF): 5 JOINs to answer one question
SELECT o.order_id, c.customer_name, p.product_name, p.price, oi.quantity,
s.store_name, r.region_name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN stores s ON o.store_id = s.store_id
JOIN regions r ON s.region_id = r.region_id;
-- Denormalized (Star Schema): 2 JOINs for the same answer
SELECT f.order_id, d_cust.customer_name, d_prod.product_name,
f.amount, f.quantity, d_store.store_name, d_store.region_name
FROM fact_sales f
JOIN dim_customer d_cust ON f.customer_key = d_cust.customer_key
JOIN dim_product d_prod ON f.product_key = d_prod.product_key
JOIN dim_store d_store ON f.store_key = d_store.store_key;
The trade-off: Normalization saves storage and prevents anomalies. Denormalization wastes some storage but makes queries 5-10x faster. Data warehouses choose speed.
Dimensional Modeling — Star Schema
Fact Tables — What Happened
A fact table records business events (transactions, measurements). Each row is one event.
fact_sales:
| sale_key | customer_key | product_key | store_key | date_key | quantity | amount | discount |
|----------|-------------|-------------|-----------|-----------|----------|---------|----------|
| 1 | 101 | 201 | 301 | 20260115 | 1 | 999.99 | 0.00 |
| 2 | 101 | 202 | 301 | 20260115 | 2 | 59.98 | 5.00 |
| 3 | 102 | 201 | 302 | 20260116 | 1 | 999.99 | 50.00 |
Fact table characteristics: – Contains foreign keys to dimension tables (customer_key, product_key) – Contains measures (numeric values you aggregate: quantity, amount, discount) – Grain: one row per transaction line item (the finest level of detail) – Typically the LARGEST table (millions/billions of rows)
Dimension Tables — The Context
Dimension tables provide the descriptive context for facts (who, what, where, when).
dim_customer:
| customer_key | customer_id | name | email | city | country | tier |
|-------------|-------------|--------|-------------------|---------|---------|---------|
| 101 | C-1001 | Naveen | naveen@email.com | Toronto | Canada | Gold |
| 102 | C-1002 | Shrey | shrey@email.com | Mumbai | India | Silver |
dim_product:
| product_key | product_id | name | category | brand | price |
|------------|-----------|----------|-------------|---------|--------|
| 201 | P-5001 | Laptop | Electronics | Dell | 999.99 |
| 202 | P-5002 | Mouse | Accessories | Logitech| 29.99 |
dim_date:
| date_key | full_date | year | quarter | month | month_name | day_of_week | is_weekend |
|----------|-----------|------|---------|-------|-----------|-------------|-----------|
| 20260115 | 2026-01-15| 2026 | Q1 | 1 | January | Thursday | 0 |
dim_store:
| store_key | store_id | store_name | city | region | country |
|----------|---------|------------|---------|----------|---------|
| 301 | S-01 | Downtown | Toronto | Ontario | Canada |
| 302 | S-02 | Mall Store | Mumbai | Maharashtra| India |
Star Schema Design
dim_customer
|
| customer_key
|
dim_product ── product_key ── FACT_SALES ── date_key ── dim_date
|
| store_key
|
dim_store
The fact table sits in the CENTER. Dimension tables surround it like points of a STAR. Each dimension connects to the fact with a single key. Hence “star” schema.
Real-life analogy: The fact table is a newspaper article (WHAT happened: “100 laptops sold for $99,999”). The dimensions are the journalist’s context questions: WHO bought them (dim_customer), WHAT was sold (dim_product), WHERE (dim_store), WHEN (dim_date). The fact tells you the event. The dimensions tell you the story.
Snowflake Schema
A snowflake schema further normalizes dimensions:
Star Schema:
dim_store: store_key, store_name, city, region, country
(all in one table)
Snowflake Schema:
dim_store: store_key, store_name, city_key
dim_city: city_key, city_name, region_key
dim_region: region_key, region_name, country_key
dim_country: country_key, country_name
(normalized into 4 tables)
Star vs Snowflake
| Feature | Star Schema | Snowflake Schema |
|---|---|---|
| Dimension structure | Denormalized (one table) | Normalized (multiple tables) |
| Joins | Fewer (fact → dimension) | More (fact → dim → dim → dim) |
| Query performance | Faster (fewer joins) | Slower (more joins) |
| Storage | More (duplicated data in dimensions) | Less (no duplication) |
| Complexity | Simpler | More complex |
| Industry preference | Preferred (90% of data warehouses) | Rare (used when storage is critical) |
The reality: Almost everyone uses star schema. The extra storage cost is negligible compared to the query performance benefit.
Slowly Changing Dimensions in Star Schema
This is where our SCD posts connect:
dim_customer (SCD Type 2):
| customer_key | customer_id | name | city | is_active | start_date | end_date |
|-------------|-------------|--------|-----------|-----------|-----------|-----------|
| 101 | C-1001 | Naveen | Mississauga| 0 | 2022-01-15 | 2024-06-30 |
| 155 | C-1001 | Naveen | Toronto | 1 | 2024-07-01 | 9999-12-31 |
Customer_key 101 (Mississauga) and 155 (Toronto) are BOTH Naveen. The fact table references the correct version: – Sales before July 2024 → customer_key = 101 (Mississauga) – Sales after July 2024 → customer_key = 155 (Toronto)
This is exactly what our SCD Type 2 pipelines build.
Surrogate Keys vs Natural Keys
| Feature | Natural Key | Surrogate Key |
|---|---|---|
| What | Business identifier (SSN, email, product_code) | System-generated integer (auto-increment) |
| Example | customer_id = ‘C-1001’ | customer_key = 101 |
| Stable? | Can change (email changes, codes merge) | Never changes |
| Size | Varies (VARCHAR, composite) | Fixed (INT, 4 bytes) |
| Performance | Slower joins (string/composite) | Faster joins (integer) |
| SCD support | Difficult (same key, different versions) | Easy (new key per version) |
Best practice: Use surrogate keys (auto-increment integers) as primary keys in dimension tables. Keep natural keys as business identifiers for lookups.
Junk, Degenerate, and Role-Playing Dimensions
Junk Dimension
Groups low-cardinality flags into one dimension instead of cluttering the fact table:
Instead of fact_sales having: is_online, is_gift_wrapped, is_express_shipping (3 BIT columns)
Create dim_order_flags:
| flag_key | is_online | is_gift_wrapped | is_express |
|----------|----------|----------------|-----------|
| 1 | 0 | 0 | 0 |
| 2 | 1 | 0 | 0 |
| 3 | 1 | 1 | 0 |
| 4 | 1 | 1 | 1 |
fact_sales references: flag_key = 3 (online, gift-wrapped, standard shipping)
Degenerate Dimension
A dimension that has no attributes other than the key — stored directly in the fact table:
fact_sales:
| sale_key | order_number | customer_key | amount |
↑
order_number is the degenerate dimension
(no separate dim_order table needed — it IS the identifier)
Role-Playing Dimension
The same dimension used multiple times with different meanings:
fact_flight:
| flight_key | departure_date_key | arrival_date_key | departure_airport_key | arrival_airport_key |
↓ ↓ ↓ ↓
dim_date dim_date dim_airport dim_airport
dim_date is used twice (departure and arrival). dim_airport is used twice. Same table, different roles.
Building a Star Schema: Complete Example
-- Dimension: Date
CREATE TABLE dim_date (
date_key INT PRIMARY KEY, -- 20260115
full_date DATE NOT NULL,
year INT, quarter INT, month INT,
month_name VARCHAR(20), day_of_week VARCHAR(20),
is_weekend BIT, is_holiday BIT
);
-- Dimension: Customer (SCD Type 2)
CREATE TABLE dim_customer (
customer_key INT IDENTITY(1,1) PRIMARY KEY, -- Surrogate
customer_id VARCHAR(20) NOT NULL, -- Natural key
name VARCHAR(100), email VARCHAR(100),
city VARCHAR(50), country VARCHAR(50), tier VARCHAR(20),
is_active BIT DEFAULT 1,
start_date DATE, end_date DATE DEFAULT '9999-12-31'
);
-- Dimension: Product
CREATE TABLE dim_product (
product_key INT IDENTITY(1,1) PRIMARY KEY,
product_id VARCHAR(20) NOT NULL,
name VARCHAR(100), category VARCHAR(50),
brand VARCHAR(50), price DECIMAL(10,2)
);
-- Fact: Sales
CREATE TABLE fact_sales (
sale_key BIGINT IDENTITY(1,1) PRIMARY KEY,
date_key INT NOT NULL REFERENCES dim_date(date_key),
customer_key INT NOT NULL REFERENCES dim_customer(customer_key),
product_key INT NOT NULL REFERENCES dim_product(product_key),
order_number VARCHAR(20), -- Degenerate dimension
quantity INT NOT NULL,
unit_price DECIMAL(10,2) NOT NULL,
discount DECIMAL(10,2) DEFAULT 0,
total_amount AS (quantity * unit_price - discount) PERSISTED
);
How This Maps to Our Blog
| Blog Post | Schema Concept |
|---|---|
| SCD Types (0,1,2,3,6) | How dimension tables handle changes |
| SCD Type 2 Pipeline | Building dim_customer with history |
| Medallion Architecture | Bronze (raw) → Silver (3NF-like) → Gold (star schema) |
| DB vs Data Warehouse | OLTP (normalized) vs OLAP (star schema) |
| Delta Lake MERGE | Loading fact and dimension tables |
Common Mistakes
-
Normalizing a data warehouse — 3NF is for OLTP. Star schema is for OLAP. Normalizing your warehouse makes queries slow (too many joins).
-
Denormalizing a transactional database — star schema in OLTP leads to update anomalies and data inconsistency.
-
No surrogate keys — using natural keys (email, product code) in fact tables makes SCD impossible and joins slower.
-
Fact table too wide — fact tables should have keys + measures only. Descriptive text belongs in dimensions.
-
No date dimension — storing raw dates in the fact table means every query must calculate year, month, quarter, is_weekend. A dim_date pre-computes these once.
-
Skipping SCD — overwriting dimension changes (SCD Type 1) loses history. Use SCD Type 2 for dimensions where history matters (customers, products).
Interview Questions
Q: What is normalization and why is it used? A: Normalization eliminates data redundancy by organizing data into related tables following rules (1NF, 2NF, 3NF). It prevents update anomalies (changing data in one place, forgetting another), reduces storage, and ensures data integrity. Used in transactional (OLTP) databases.
Q: What is the difference between 2NF and 3NF? A: 2NF eliminates partial dependencies — every non-key column must depend on the ENTIRE primary key (relevant for composite keys). 3NF eliminates transitive dependencies — no non-key column should depend on another non-key column. 3NF = 2NF + no non-key → non-key dependencies.
Q: What is a star schema and why is it used in data warehouses? A: A star schema has a central fact table (business events with measures) surrounded by dimension tables (descriptive context — who, what, where, when). It is denormalized for fast read performance — queries need fewer joins than normalized schemas. Used in data warehouses (OLAP) where read speed matters more than write efficiency.
Q: What is the difference between a fact table and a dimension table? A: Fact tables store business events (transactions, measurements) with numeric measures (amount, quantity) and foreign keys to dimensions. They are typically large (billions of rows). Dimension tables store descriptive context (customer name, product category, store location) and are typically small (thousands to millions of rows).
Q: What is a surrogate key and why is it important? A: A surrogate key is a system-generated integer (IDENTITY/auto-increment) used as the primary key in dimension tables. Unlike natural keys (email, product code), surrogate keys never change, are compact (INT), and support SCD Type 2 (multiple versions of the same entity with different surrogate keys). Every dimension table should use surrogate keys.
Wrapping Up
Normalization and dimensional modeling are two sides of the same coin. Normalization builds clean, consistent source systems. Star schema builds fast, queryable data warehouses. As a data engineer, you bridge both worlds — reading from normalized sources and writing to star schema destinations.
The path is clear: source databases (3NF) → Bronze (raw) → Silver (cleaned, 3NF-ish) → Gold (star schema). Every pipeline we built on this blog follows this path.
Related posts: – SCD Types (0,1,2,3,6) – Medallion Architecture – DB vs Data Warehouse – SQL DDL, DML, Constraints
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.