SQL Normalization and Star Schema: 1NF, 2NF, 3NF, Dimensional Modeling, and Designing Databases Like a Data Engineer

SQL Normalization and Star Schema: 1NF, 2NF, 3NF, Dimensional Modeling, and Designing Databases Like a Data Engineer

There are two ways to design a database. Normalization (1NF, 2NF, 3NF) eliminates redundancy — used in transactional systems where data is written frequently. Dimensional modeling (star schema, snowflake schema) optimizes for reading — used in data warehouses where analysts run complex queries.

As a data engineer, you work with BOTH. You read from normalized source databases (OLTP) and write to denormalized star schemas (OLAP/data warehouse). Understanding both designs — and knowing when to apply each — is fundamental.

Think of normalization like organizing a library with strict rules: every book appears exactly once, every author’s info is stored in one place, and changing an author’s name requires updating only one record. A star schema is like a pre-built reading room: the librarian has already gathered everything you might need for “19th century French literature” into one spot — even if some author info is duplicated across reading rooms. The library (normalized) is efficient for adding books. The reading room (star schema) is efficient for research.

Table of Contents

  • OLTP vs OLAP — Two Different Worlds
  • Normalization — Eliminating Redundancy
  • First Normal Form (1NF)
  • Second Normal Form (2NF)
  • Third Normal Form (3NF)
  • Denormalization — When to Break the Rules
  • Dimensional Modeling — Star Schema
  • Fact Tables — What Happened
  • Dimension Tables — The Context
  • Star Schema Design
  • Snowflake Schema
  • Star vs Snowflake
  • Slowly Changing Dimensions in Star Schema
  • Building a Star Schema: Complete Example
  • Surrogate Keys vs Natural Keys
  • Junk Dimensions, Degenerate Dimensions, Role-Playing Dimensions
  • How This Maps to Our Blog
  • Common Mistakes
  • Interview Questions
  • Wrapping Up

OLTP vs OLAP — Two Different Worlds

Feature OLTP (Transactional) OLAP (Analytical)
Purpose Record transactions Analyze trends
Users Applications, cashiers, websites Analysts, dashboards, ML models
Queries Simple (INSERT, UPDATE one row) Complex (JOIN 5 tables, GROUP BY, SUM)
Design Normalized (3NF) Denormalized (Star Schema)
Redundancy Eliminated (each fact stored once) Intentional (faster reads)
Example Banking app, e-commerce checkout Revenue dashboard, churn analysis
Write speed Optimized (fast inserts) Not optimized (batch loads)
Read speed Not optimized (many joins) Optimized (fewer joins)

As a data engineer, your job is to move data FROM OLTP (normalized) TO OLAP (star schema). That is what our Bronze → Silver → Gold pipelines do.

Normalization — Eliminating Redundancy

The Problem: Unnormalized Data

Orders Table (one big table — no normalization):
| order_id | customer_name | customer_email    | customer_city | product_name | product_price | quantity | order_date |
|----------|-------------- |-------------------|---------------|-------------|---------------|----------|------------|
| 1        | Naveen        | naveen@email.com  | Toronto       | Laptop       | 999.99        | 1        | 2026-01-15 |
| 2        | Naveen        | naveen@email.com  | Toronto       | Mouse        | 29.99         | 2        | 2026-01-15 |
| 3        | Shrey         | shrey@email.com   | Mumbai        | Laptop       | 999.99        | 1        | 2026-01-16 |
| 4        | Naveen        | naveen.new@email  | Toronto       | Keyboard     | 79.99         | 1        | 2026-02-01 |

Problems: – Naveen’s info is duplicated across 3 rows – Row 4 has a different email for Naveen — which is correct? – Laptop price stored 2 times — update one, forget the other = inconsistency – Deleting order 3 loses all info about Shrey if it was his only order

First Normal Form (1NF) — Atomic Values

Rule: Every cell contains ONE value (not a list or repeating group).

❌ Violates 1NF (repeating group):
| order_id | customer | products              |
|----------|----------|-----------------------|
| 1        | Naveen   | Laptop, Mouse, Keyboard|

✅ Satisfies 1NF (one value per cell):
| order_id | customer | product  |
|----------|----------|----------|
| 1        | Naveen   | Laptop   |
| 1        | Naveen   | Mouse    |
| 1        | Naveen   | Keyboard |

Real-life analogy: 1NF is like a rule that every shelf in a bookshelf holds ONE book. No stacking books on top of each other. If you have 3 books, you need 3 shelves.

Second Normal Form (2NF) — No Partial Dependencies

Rule: Already in 1NF + every non-key column depends on the ENTIRE primary key (not just part of it).

❌ Violates 2NF (partial dependency):
Primary Key: (order_id, product_id)

| order_id | product_id | product_name | product_price | quantity |
|----------|-----------|-------------|---------------|----------|
| 1        | 101       | Laptop       | 999.99        | 1        |
| 1        | 102       | Mouse        | 29.99         | 2        |

product_name and product_price depend ONLY on product_id, not on the full key (order_id, product_id).

✅ Satisfies 2NF (separate tables):
Order_Items: (order_id, product_id, quantity)
Products: (product_id, product_name, product_price)

Real-life analogy: 2NF is like saying “the recipe should not contain the oven’s instruction manual.” The recipe (order_items) should only contain what is specific to THIS recipe. General oven instructions (product details) belong in a separate document (products table).

Third Normal Form (3NF) — No Transitive Dependencies

Rule: Already in 2NF + no non-key column depends on another non-key column.

❌ Violates 3NF (transitive dependency):
| emp_id | emp_name | dept_id | dept_name    | dept_manager |
|--------|----------|---------|-------------|-------------|
| 1001   | Naveen   | 10      | Engineering  | Alice        |
| 1002   | Shrey    | 10      | Engineering  | Alice        |

dept_name and dept_manager depend on dept_id, not on emp_id.
Change manager? Must update EVERY employee in that department.

✅ Satisfies 3NF (separate tables):
Employees: (emp_id, emp_name, dept_id)
Departments: (dept_id, dept_name, dept_manager)

Real-life analogy: 3NF is like keeping your home address only on your driver’s license, not also on your credit card, gym membership, and library card. If you move, you update ONE place (driver’s license / departments table), not 5 places.

Summary: 1NF → 2NF → 3NF

Normal Form Rule Eliminates
1NF Every cell has one atomic value Repeating groups, multi-value cells
2NF 1NF + no partial dependencies on composite keys Partial dependencies
3NF 2NF + no transitive dependencies (non-key → non-key) Redundant data stored in multiple places

Denormalization — When to Break the Rules

Normalization is great for WRITING (no duplicates, easy updates). But terrible for READING:

-- Normalized (3NF): 5 JOINs to answer one question
SELECT o.order_id, c.customer_name, p.product_name, p.price, oi.quantity,
       s.store_name, r.region_name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN stores s ON o.store_id = s.store_id
JOIN regions r ON s.region_id = r.region_id;

-- Denormalized (Star Schema): 2 JOINs for the same answer
SELECT f.order_id, d_cust.customer_name, d_prod.product_name,
       f.amount, f.quantity, d_store.store_name, d_store.region_name
FROM fact_sales f
JOIN dim_customer d_cust ON f.customer_key = d_cust.customer_key
JOIN dim_product d_prod ON f.product_key = d_prod.product_key
JOIN dim_store d_store ON f.store_key = d_store.store_key;

The trade-off: Normalization saves storage and prevents anomalies. Denormalization wastes some storage but makes queries 5-10x faster. Data warehouses choose speed.

Dimensional Modeling — Star Schema

Fact Tables — What Happened

A fact table records business events (transactions, measurements). Each row is one event.

fact_sales:
| sale_key | customer_key | product_key | store_key | date_key  | quantity | amount  | discount |
|----------|-------------|-------------|-----------|-----------|----------|---------|----------|
| 1        | 101         | 201         | 301       | 20260115  | 1        | 999.99  | 0.00     |
| 2        | 101         | 202         | 301       | 20260115  | 2        | 59.98   | 5.00     |
| 3        | 102         | 201         | 302       | 20260116  | 1        | 999.99  | 50.00    |

Fact table characteristics: – Contains foreign keys to dimension tables (customer_key, product_key) – Contains measures (numeric values you aggregate: quantity, amount, discount) – Grain: one row per transaction line item (the finest level of detail) – Typically the LARGEST table (millions/billions of rows)

Dimension Tables — The Context

Dimension tables provide the descriptive context for facts (who, what, where, when).

dim_customer:
| customer_key | customer_id | name   | email             | city    | country | tier    |
|-------------|-------------|--------|-------------------|---------|---------|---------|
| 101         | C-1001      | Naveen | naveen@email.com  | Toronto | Canada  | Gold    |
| 102         | C-1002      | Shrey  | shrey@email.com   | Mumbai  | India   | Silver  |

dim_product:
| product_key | product_id | name     | category    | brand   | price  |
|------------|-----------|----------|-------------|---------|--------|
| 201        | P-5001    | Laptop   | Electronics | Dell    | 999.99 |
| 202        | P-5002    | Mouse    | Accessories | Logitech| 29.99  |

dim_date:
| date_key | full_date  | year | quarter | month | month_name | day_of_week | is_weekend |
|----------|-----------|------|---------|-------|-----------|-------------|-----------|
| 20260115 | 2026-01-15| 2026 | Q1      | 1     | January   | Thursday    | 0         |

dim_store:
| store_key | store_id | store_name  | city    | region    | country |
|----------|---------|------------|---------|----------|---------|
| 301      | S-01    | Downtown   | Toronto | Ontario  | Canada  |
| 302      | S-02    | Mall Store | Mumbai  | Maharashtra| India  |

Star Schema Design

dim_customer
                         |
                         |  customer_key
                         |
dim_product ── product_key ── FACT_SALES ── date_key ── dim_date
                         |
                         |  store_key
                         |
                    dim_store

The fact table sits in the CENTER. Dimension tables surround it like points of a STAR. Each dimension connects to the fact with a single key. Hence “star” schema.

Real-life analogy: The fact table is a newspaper article (WHAT happened: “100 laptops sold for $99,999”). The dimensions are the journalist’s context questions: WHO bought them (dim_customer), WHAT was sold (dim_product), WHERE (dim_store), WHEN (dim_date). The fact tells you the event. The dimensions tell you the story.

Snowflake Schema

A snowflake schema further normalizes dimensions:

Star Schema:
  dim_store: store_key, store_name, city, region, country
  (all in one table)

Snowflake Schema:
  dim_store: store_key, store_name, city_key
  dim_city: city_key, city_name, region_key
  dim_region: region_key, region_name, country_key
  dim_country: country_key, country_name
  (normalized into 4 tables)

Star vs Snowflake

Feature Star Schema Snowflake Schema
Dimension structure Denormalized (one table) Normalized (multiple tables)
Joins Fewer (fact → dimension) More (fact → dim → dim → dim)
Query performance Faster (fewer joins) Slower (more joins)
Storage More (duplicated data in dimensions) Less (no duplication)
Complexity Simpler More complex
Industry preference Preferred (90% of data warehouses) Rare (used when storage is critical)

The reality: Almost everyone uses star schema. The extra storage cost is negligible compared to the query performance benefit.

Slowly Changing Dimensions in Star Schema

This is where our SCD posts connect:

dim_customer (SCD Type 2):
| customer_key | customer_id | name   | city       | is_active | start_date | end_date   |
|-------------|-------------|--------|-----------|-----------|-----------|-----------|
| 101         | C-1001      | Naveen | Mississauga| 0         | 2022-01-15 | 2024-06-30 |
| 155         | C-1001      | Naveen | Toronto    | 1         | 2024-07-01 | 9999-12-31 |

Customer_key 101 (Mississauga) and 155 (Toronto) are BOTH Naveen. The fact table references the correct version: – Sales before July 2024 → customer_key = 101 (Mississauga) – Sales after July 2024 → customer_key = 155 (Toronto)

This is exactly what our SCD Type 2 pipelines build.

Surrogate Keys vs Natural Keys

Feature Natural Key Surrogate Key
What Business identifier (SSN, email, product_code) System-generated integer (auto-increment)
Example customer_id = ‘C-1001’ customer_key = 101
Stable? Can change (email changes, codes merge) Never changes
Size Varies (VARCHAR, composite) Fixed (INT, 4 bytes)
Performance Slower joins (string/composite) Faster joins (integer)
SCD support Difficult (same key, different versions) Easy (new key per version)

Best practice: Use surrogate keys (auto-increment integers) as primary keys in dimension tables. Keep natural keys as business identifiers for lookups.

Junk, Degenerate, and Role-Playing Dimensions

Junk Dimension

Groups low-cardinality flags into one dimension instead of cluttering the fact table:

Instead of fact_sales having: is_online, is_gift_wrapped, is_express_shipping (3 BIT columns)

Create dim_order_flags:
| flag_key | is_online | is_gift_wrapped | is_express |
|----------|----------|----------------|-----------|
| 1        | 0        | 0              | 0         |
| 2        | 1        | 0              | 0         |
| 3        | 1        | 1              | 0         |
| 4        | 1        | 1              | 1         |

fact_sales references: flag_key = 3 (online, gift-wrapped, standard shipping)

Degenerate Dimension

A dimension that has no attributes other than the key — stored directly in the fact table:

fact_sales:
| sale_key | order_number | customer_key | amount |
                ↑
  order_number is the degenerate dimension
  (no separate dim_order table needed — it IS the identifier)

Role-Playing Dimension

The same dimension used multiple times with different meanings:

fact_flight:
| flight_key | departure_date_key | arrival_date_key | departure_airport_key | arrival_airport_key |
                      ↓                    ↓                    ↓                       ↓
                  dim_date              dim_date           dim_airport             dim_airport

dim_date is used twice (departure and arrival). dim_airport is used twice. Same table, different roles.

Building a Star Schema: Complete Example

-- Dimension: Date
CREATE TABLE dim_date (
    date_key INT PRIMARY KEY,                    -- 20260115
    full_date DATE NOT NULL,
    year INT, quarter INT, month INT,
    month_name VARCHAR(20), day_of_week VARCHAR(20),
    is_weekend BIT, is_holiday BIT
);

-- Dimension: Customer (SCD Type 2)
CREATE TABLE dim_customer (
    customer_key INT IDENTITY(1,1) PRIMARY KEY,  -- Surrogate
    customer_id VARCHAR(20) NOT NULL,             -- Natural key
    name VARCHAR(100), email VARCHAR(100),
    city VARCHAR(50), country VARCHAR(50), tier VARCHAR(20),
    is_active BIT DEFAULT 1,
    start_date DATE, end_date DATE DEFAULT '9999-12-31'
);

-- Dimension: Product
CREATE TABLE dim_product (
    product_key INT IDENTITY(1,1) PRIMARY KEY,
    product_id VARCHAR(20) NOT NULL,
    name VARCHAR(100), category VARCHAR(50),
    brand VARCHAR(50), price DECIMAL(10,2)
);

-- Fact: Sales
CREATE TABLE fact_sales (
    sale_key BIGINT IDENTITY(1,1) PRIMARY KEY,
    date_key INT NOT NULL REFERENCES dim_date(date_key),
    customer_key INT NOT NULL REFERENCES dim_customer(customer_key),
    product_key INT NOT NULL REFERENCES dim_product(product_key),
    order_number VARCHAR(20),                     -- Degenerate dimension
    quantity INT NOT NULL,
    unit_price DECIMAL(10,2) NOT NULL,
    discount DECIMAL(10,2) DEFAULT 0,
    total_amount AS (quantity * unit_price - discount) PERSISTED
);

How This Maps to Our Blog

Blog Post Schema Concept
SCD Types (0,1,2,3,6) How dimension tables handle changes
SCD Type 2 Pipeline Building dim_customer with history
Medallion Architecture Bronze (raw) → Silver (3NF-like) → Gold (star schema)
DB vs Data Warehouse OLTP (normalized) vs OLAP (star schema)
Delta Lake MERGE Loading fact and dimension tables

Common Mistakes

  1. Normalizing a data warehouse — 3NF is for OLTP. Star schema is for OLAP. Normalizing your warehouse makes queries slow (too many joins).

  2. Denormalizing a transactional database — star schema in OLTP leads to update anomalies and data inconsistency.

  3. No surrogate keys — using natural keys (email, product code) in fact tables makes SCD impossible and joins slower.

  4. Fact table too wide — fact tables should have keys + measures only. Descriptive text belongs in dimensions.

  5. No date dimension — storing raw dates in the fact table means every query must calculate year, month, quarter, is_weekend. A dim_date pre-computes these once.

  6. Skipping SCD — overwriting dimension changes (SCD Type 1) loses history. Use SCD Type 2 for dimensions where history matters (customers, products).

Interview Questions

Q: What is normalization and why is it used? A: Normalization eliminates data redundancy by organizing data into related tables following rules (1NF, 2NF, 3NF). It prevents update anomalies (changing data in one place, forgetting another), reduces storage, and ensures data integrity. Used in transactional (OLTP) databases.

Q: What is the difference between 2NF and 3NF? A: 2NF eliminates partial dependencies — every non-key column must depend on the ENTIRE primary key (relevant for composite keys). 3NF eliminates transitive dependencies — no non-key column should depend on another non-key column. 3NF = 2NF + no non-key → non-key dependencies.

Q: What is a star schema and why is it used in data warehouses? A: A star schema has a central fact table (business events with measures) surrounded by dimension tables (descriptive context — who, what, where, when). It is denormalized for fast read performance — queries need fewer joins than normalized schemas. Used in data warehouses (OLAP) where read speed matters more than write efficiency.

Q: What is the difference between a fact table and a dimension table? A: Fact tables store business events (transactions, measurements) with numeric measures (amount, quantity) and foreign keys to dimensions. They are typically large (billions of rows). Dimension tables store descriptive context (customer name, product category, store location) and are typically small (thousands to millions of rows).

Q: What is a surrogate key and why is it important? A: A surrogate key is a system-generated integer (IDENTITY/auto-increment) used as the primary key in dimension tables. Unlike natural keys (email, product code), surrogate keys never change, are compact (INT), and support SCD Type 2 (multiple versions of the same entity with different surrogate keys). Every dimension table should use surrogate keys.

Wrapping Up

Normalization and dimensional modeling are two sides of the same coin. Normalization builds clean, consistent source systems. Star schema builds fast, queryable data warehouses. As a data engineer, you bridge both worlds — reading from normalized sources and writing to star schema destinations.

The path is clear: source databases (3NF) → Bronze (raw) → Silver (cleaned, 3NF-ish) → Gold (star schema). Every pipeline we built on this blog follows this path.

Related posts:SCD Types (0,1,2,3,6)Medallion ArchitectureDB vs Data WarehouseSQL DDL, DML, Constraints


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link