AWS Glue Data Catalog and Crawlers Explained: Metadata Management for Your Data Lake

If you have data sitting in Amazon S3 — CSV files, Parquet files, JSON files — how do you query it with SQL? You cannot just point Athena at a folder and say “SELECT *”. Something needs to understand the structure of those files first: what columns exist, what data types they are, where the files are located, and how they are partitioned.

That something is the AWS Glue Data Catalog. And the tool that automatically discovers and registers your data is the Glue Crawler.

Together, they turn your unstructured data lake into a queryable, organized, metadata-rich catalog that Athena, Redshift Spectrum, EMR, and Spark can all use.

This post covers both in detail — what they are, how they work, how to set them up, and how they fit into a production data lake architecture.

What Is the AWS Glue Data Catalog?
What Is a Glue Crawler?
How They Work Together
The Data Catalog Hierarchy
Step 1: Create a Glue Database
Step 2: Create and Run a Crawler
Step 3: Explore the Catalog Tables
Step 4: Query with Athena
Crawler Configuration Options
How Crawlers Detect Schema
Handling Schema Changes
Partitions and How Crawlers Discover Them
Scheduling Crawlers
Manual Table Definitions (Without Crawlers)
Data Catalog vs Hive Metastore
Glue Data Catalog in the AWS Ecosystem
IAM Permissions for Glue
Cost and Pricing
Best Practices
Common Errors and Fixes
Comparison: Glue Catalog vs Azure Equivalent
Interview Questions
Wrapping Up

What Is the AWS Glue Data Catalog?

The AWS Glue Data Catalog is a centralized metadata repository for all your data assets. It stores information about your data — where it lives, what format it is in, what columns it has, and how it is partitioned — but it does NOT store the actual data.

Think of it as a library catalog. The catalog tells you “there is a book called ‘Customer Data’ on shelf S3, section bronze, written in Parquet format, with chapters (columns) for id, name, email, and city.” But the actual book (data) remains on the shelf (S3).

What It Stores

Metadata	Example
Database	`datalake_bronze`
Table name	`customers`
Location	`s3://my-datalake/bronze/customers/`
Format	Parquet
Columns	id (int), name (string), email (string), city (string)
Partitions	year=2026/month=04/day=07
Row count (estimated)	1,000,000
Table properties	classification=parquet, compressionType=snappy

Who Uses the Catalog

Service	How It Uses the Catalog
Amazon Athena	Reads table definitions to query S3 data with SQL
Amazon Redshift Spectrum	Uses catalog tables for external table queries
AWS Glue ETL	Reads/writes catalog tables in Spark jobs
Amazon EMR	Uses catalog as the Hive metastore replacement
AWS Lake Formation	Builds access control on top of catalog tables
Amazon QuickSight	Discovers datasets through the catalog

The Data Catalog is the single source of truth for metadata across the entire AWS analytics ecosystem.

What Is a Glue Crawler?

A Glue Crawler is an automated metadata discovery tool. You point it at an S3 location (or a database), and it:

Scans the files in that location
Reads sample data to infer the schema (columns and data types)
Detects the file format (Parquet, CSV, JSON, Avro, ORC)
Discovers partitions (year=2026/month=04/ folder structure)
Creates or updates table definitions in the Data Catalog

Without a crawler, you would have to manually define every table — specify every column name, every data type, every partition. For a data lake with 200 tables, that is unsustainable.

With a crawler, you point it at your S3 prefix, run it, and it creates all 200 table definitions automatically.

How They Work Together

S3 Data Lake                    Glue Crawler              Data Catalog
+------------------+            +----------+              +------------------+
| bronze/           |            |          |              | Database:        |
|   customers/      |  -------> | Scans    | --------->   |   datalake_bronze|
|     data.parquet  |            | files,   |              |                  |
|   orders/         |            | infers   |              | Tables:          |
|     data.parquet  |            | schema   |              |   customers      |
|   products/       |            |          |              |   orders         |
|     data.parquet  |            +----------+              |   products       |
+------------------+                                       +------------------+
                                                                   |
                                                                   v
                                                           Amazon Athena
                                                           "SELECT * FROM
                                                            datalake_bronze.customers"

The Data Catalog Hierarchy

AWS Account
  |-- Glue Data Catalog (one per region)
      |-- Database: datalake_bronze
      |   |-- Table: customers (s3://bucket/bronze/customers/)
      |   |-- Table: orders (s3://bucket/bronze/orders/)
      |   |-- Table: products (s3://bucket/bronze/products/)
      |
      |-- Database: datalake_silver
      |   |-- Table: customers_cleaned
      |   |-- Table: orders_enriched
      |
      |-- Database: datalake_gold
          |-- Table: dim_customer
          |-- Table: fact_sales

Database = a logical grouping of tables (like a schema in SQL Server). Does not store data.

Table = metadata about one dataset. Points to an S3 location. Defines columns, types, format, and partitions.

Step 1: Create a Glue Database

Using AWS Console

Go to AWS Glue Console > Databases (under Data Catalog in the left sidebar)
Click Add database
Name: datalake_bronze
Description: “Raw data ingested from source systems”
Click Create database

Using AWS CLI

aws glue create-database     --database-input '{
        "Name": "datalake_bronze",
        "Description": "Raw data ingested from source systems"
    }'

Using Python (boto3)

import boto3

glue = boto3.client('glue')

glue.create_database(
    DatabaseInput={
        'Name': 'datalake_bronze',
        'Description': 'Raw data ingested from source systems'
    }
)

Step 2: Create and Run a Crawler

Prerequisites

Make sure you have data in S3:

s3://my-datalake/bronze/customers/part-00000.snappy.parquet
s3://my-datalake/bronze/orders/part-00000.snappy.parquet
s3://my-datalake/bronze/products/part-00000.snappy.parquet

Create Crawler in Console

Go to AWS Glue Console > Crawlers (under Data Catalog)
Click Create crawler
Name: bronze-crawler
Click Next
Data sources:
Click Add a data source
Data source: S3
S3 path: s3://my-datalake/bronze/
Crawl all sub-folders: Yes
Click Add an S3 data source
Click Next
IAM role:
Create new IAM role: AWSGlueServiceRole-bronze-crawler
This role needs AmazonS3ReadOnlyAccess and AWSGlueServiceRole policies
Click Next
Target database: datalake_bronze
Table name prefix: leave blank (or add a prefix like raw_)
Click Next
Schedule: Run on demand (or set a schedule)
Click Create crawler

Run the Crawler

Select your crawler > click Run crawler
Wait 1-3 minutes
Status changes to Succeeded
Check Tables added: should show 3 (customers, orders, products)

What the Crawler Created

Go to Databases > datalake_bronze > Tables. You will see:

Table	Location	Format	Columns
customers	s3://my-datalake/bronze/customers/	parquet	id (int), name (string), email (string), city (string)
orders	s3://my-datalake/bronze/orders/	parquet	order_id (int), customer_id (int), amount (double), order_date (string)
products	s3://my-datalake/bronze/products/	parquet	product_id (int), name (string), category (string), price (double)

The crawler automatically detected the Parquet schema, including column names and data types.

Step 3: Explore the Catalog Tables

View Table Details

Click on a table name (e.g., customers)
You see:
Schema: all columns with data types
Location: S3 path
Input format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
Serde: Parquet serialization library
Table properties: classification, compressionType, etc.
Partitions: if any were detected

View via AWS CLI

aws glue get-table     --database-name datalake_bronze     --name customers

View via Python

response = glue.get_table(
    DatabaseName='datalake_bronze',
    Name='customers'
)

for col in response['Table']['StorageDescriptor']['Columns']:
    print(f"{col['Name']}: {col['Type']}")

Step 4: Query with Athena

Once tables are in the Data Catalog, Athena can query them immediately:

-- Select the datalake_bronze database in Athena
-- No data loading needed -- Athena reads directly from S3

SELECT * FROM datalake_bronze.customers
LIMIT 10;

SELECT city, COUNT(*) as customer_count
FROM datalake_bronze.customers
GROUP BY city
ORDER BY customer_count DESC;

-- Join across catalog tables
SELECT c.name, o.amount, o.order_date
FROM datalake_bronze.customers c
JOIN datalake_bronze.orders o ON c.id = o.customer_id
WHERE o.amount > 500;

This is the power of the Data Catalog: you write standard SQL against S3 data without any ETL, loading, or database provisioning.

Crawler Configuration Options

Crawler Behavior on Subsequent Runs

When a crawler runs again on the same data:

Scenario	Crawler Action
New files with same schema	Updates row count, no schema change
New files with new columns	Adds new columns to the table
New partition folders	Adds new partitions
Files deleted	Table remains (does not auto-delete)
Completely new data format	Creates a new table

Configuration Settings

Setting	Options	Recommendation
Recrawl policy	Crawl all folders / Crawl new folders only	Crawl new folders only (faster for large lakes)
Schema change policy	Update table / Add new columns only / Log changes	Add new columns only (safest)
Object deletion policy	Delete from catalog / Mark as deprecated / Log	Mark as deprecated (never auto-delete)
Table grouping	Create single table / Create per-folder table	Create per-folder (standard for data lakes)
Sample size	Number of files to sample for schema	Default is fine for most cases

How Crawlers Detect Schema

For Parquet/ORC/Avro

Schema is embedded in the file. The crawler reads the file metadata — no sampling needed. Column names and types are exact.

For CSV

No embedded schema. The crawler: 1. Reads the first row (checks if it looks like a header) 2. Samples several rows to infer data types 3. Makes best guesses: “123” could be int or string

CSV issues: The crawler might guess wrong. “2026-04-07” could be detected as string instead of date. You may need to manually update the table schema after crawling CSV files.

For JSON

Semi-structured. The crawler reads multiple records and infers a schema from the union of all fields. Nested JSON creates struct types.

Handling Schema Changes

New Columns Added to Source

When new files have extra columns:

Crawler detects new columns
Based on policy, either adds them to the table or logs a warning
Existing queries continue to work (new columns are NULL for old data)

Columns Removed from Source

Crawler does not remove columns from the catalog (they stay with NULL values)
Old queries referencing removed columns still work against old files

Data Type Changes

Crawler detects type mismatch (column was int, now it is string)
Based on policy, either updates the type or logs a warning
Risk: Changing types can break existing queries

Best practice: Set schema change policy to Add new columns only and handle type changes manually.

Partitions and How Crawlers Discover Them

Hive-Style Partitions

If your S3 data is organized like:

s3://my-datalake/bronze/orders/year=2026/month=04/day=07/data.parquet
s3://my-datalake/bronze/orders/year=2026/month=04/day=08/data.parquet
s3://my-datalake/bronze/orders/year=2026/month=03/day=15/data.parquet

The crawler automatically: 1. Detects year, month, day as partition keys 2. Registers each unique combination as a partition in the catalog 3. Athena can then use partition pruning:

-- Only scans the April 7 partition (not all data)
SELECT * FROM orders WHERE year='2026' AND month='04' AND day='07';

Adding New Partitions

When new daily data arrives, you have two options:

Option A: Rerun the crawler — it discovers new partitions automatically

Option B: Use MSCK REPAIR TABLE in Athena (faster):

MSCK REPAIR TABLE datalake_bronze.orders;

This scans S3 for new partition folders and adds them to the catalog without running the full crawler.

Option C: Use Glue API (most efficient):

glue.batch_create_partition(
    DatabaseName='datalake_bronze',
    TableName='orders',
    PartitionInputList=[{
        'Values': ['2026', '04', '09'],
        'StorageDescriptor': {
            'Location': 's3://my-datalake/bronze/orders/year=2026/month=04/day=09/',
            # ... same storage descriptor as the table
        }
    }]
)

Scheduling Crawlers

On-Demand (Manual)

Run whenever you want from the console or CLI:

aws glue start-crawler --name bronze-crawler

Scheduled (Cron)

Set the crawler to run on a schedule:

Daily at 3 AM UTC: cron(0 3 * * ? *)
Every 6 hours:     cron(0 */6 * * ? *)
Monday at midnight: cron(0 0 ? * MON *)

Event-Driven

Trigger the crawler when new data arrives: 1. S3 event notification triggers a Lambda function 2. Lambda function starts the crawler

# Lambda function triggered by S3 event
def handler(event, context):
    glue = boto3.client('glue')
    glue.start_crawler(Name='bronze-crawler')

Manual Table Definitions (Without Crawlers)

You do not always need a crawler. You can define tables manually:

Using Athena DDL

CREATE EXTERNAL TABLE datalake_bronze.customers (
    id INT,
    name STRING,
    email STRING,
    city STRING
)
STORED AS PARQUET
LOCATION 's3://my-datalake/bronze/customers/'
TBLPROPERTIES ('parquet.compression'='SNAPPY');

With Partitions

CREATE EXTERNAL TABLE datalake_bronze.orders (
    order_id INT,
    customer_id INT,
    amount DOUBLE,
    order_date STRING
)
PARTITIONED BY (year STRING, month STRING, day STRING)
STORED AS PARQUET
LOCATION 's3://my-datalake/bronze/orders/'
TBLPROPERTIES ('parquet.compression'='SNAPPY');

-- Load partitions
MSCK REPAIR TABLE datalake_bronze.orders;

When to Use Manual vs Crawler

Scenario	Use Crawler	Use Manual DDL
Many tables, unknown schemas	Yes	No
Schemas change frequently	Yes	No
You know the exact schema	No	Yes (more control)
CSV with messy headers	No	Yes (you define types correctly)
Partitioned data	Either works	Yes (more explicit)

Data Catalog vs Hive Metastore

The Glue Data Catalog serves the same purpose as the Apache Hive Metastore but is fully managed:

Feature	Hive Metastore	Glue Data Catalog
Management	You manage (MySQL/PostgreSQL backend)	Fully managed by AWS
Availability	Depends on your setup	Built-in HA
Integration	Hive, Spark	Athena, Redshift, Glue, EMR, Lake Formation
Cost	You pay for infra	Pay per API call
Crawlers	No built-in discovery	Built-in crawlers
Access control	Basic	Lake Formation fine-grained access

Most AWS customers have migrated from self-managed Hive Metastore to Glue Data Catalog.

Glue Data Catalog in the AWS Ecosystem

S3 (Data Lake)
  |
  |-- Glue Crawler --> Glue Data Catalog (metadata)
  |                         |
  |                         |-- Athena (SQL queries)
  |                         |-- Redshift Spectrum (external tables)
  |                         |-- EMR/Spark (table references)
  |                         |-- Glue ETL (source/sink)
  |                         |-- Lake Formation (access control)
  |                         |-- QuickSight (data discovery)
  |
  |-- Glue ETL Jobs (transform data)
  |-- Glue Workflows (orchestrate ETL)

IAM Permissions for Glue

Crawler IAM Role

The crawler needs an IAM role with:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::my-datalake",
                "arn:aws:s3:::my-datalake/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "glue:*"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:*:*:*"
        }
    ]
}

Also attach the managed policy: AWSGlueServiceRole.

User Permissions

For a data engineer to use the catalog:

{
    "Effect": "Allow",
    "Action": [
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetPartitions",
        "glue:CreateTable",
        "glue:UpdateTable",
        "glue:BatchCreatePartition"
    ],
    "Resource": "*"
}

Cost and Pricing

Component	Cost
Storing metadata	First 1 million objects free, then $1 per 100,000 objects/month
API requests	First 1 million free, then $1 per 1 million requests
Crawler runtime	$0.44 per DPU-hour (usually runs for a few minutes)

For most data lakes, the Data Catalog cost is negligible — often under $5/month.

Best Practices

One database per data layer: datalake_bronze, datalake_silver, datalake_gold
Use Parquet instead of CSV — crawlers detect Parquet schema perfectly
Use Hive-style partitions (year=2026/month=04/) for automatic partition discovery
Set crawler schema policy to “Add new columns only” — prevents breaking changes
Schedule crawlers after pipeline runs — crawl new data promptly
Use MSCK REPAIR TABLE for adding partitions in between crawler runs
Name tables consistently — snake_case, match the source system names
Add table descriptions — help other team members discover and understand data

Common Errors and Fixes

Error	Cause	Fix
“Database not found”	Database does not exist in the catalog	Create the database first
“Access denied” on S3	Crawler IAM role cannot read S3	Add `s3:GetObject` and `s3:ListBucket` to the role
Crawler creates too many tables	Files in different formats in the same folder	Separate files by format or use classifiers
Wrong data types for CSV	Crawler guessed types incorrectly	Manually update the table schema in the catalog
Missing partitions	Crawler has not run since new data arrived	Run crawler or `MSCK REPAIR TABLE`
Table shows 0 rows in Athena	Partitions not registered	Run `MSCK REPAIR TABLE`

Comparison: Glue Catalog vs Azure Equivalent

Feature	AWS Glue Data Catalog	Azure Equivalent
Metadata store	Glue Data Catalog	Azure Purview / Synapse Serverless external tables
Auto-discovery	Glue Crawler	Purview scanning / manual DDL
Query engine	Athena	Synapse Serverless SQL
Access control	Lake Formation	Synapse workspace RBAC + ADLS ACLs
Cost	Per API call (very cheap)	Per query data scanned (Synapse)

Interview Questions

Q: What is the AWS Glue Data Catalog? A: A centralized metadata repository that stores information about your data assets — table definitions, column schemas, data locations, formats, and partitions. It does not store actual data. It is used by Athena, Redshift Spectrum, Glue ETL, and EMR to understand the structure of data in S3.

Q: What is a Glue Crawler and why do you need it? A: A crawler is an automated tool that scans data in S3 (or other sources), infers the schema, and creates or updates table definitions in the Data Catalog. Without it, you would have to manually define every table and column.

Q: How does a crawler handle schema changes? A: Based on the configured policy. It can add new columns, update existing types, or just log changes. The safest setting is “Add new columns only” which prevents breaking existing queries.

Q: How do Athena queries use the Data Catalog? A: Athena reads table definitions (location, format, schema, partitions) from the Data Catalog. When you run a query, Athena uses this metadata to find the S3 files, apply the correct deserializer, and return structured results. No data is loaded into Athena — it queries S3 directly.

Q: What is the difference between running a crawler and MSCK REPAIR TABLE? A: A crawler scans files and infers schema from scratch (can add columns, detect format changes). MSCK REPAIR TABLE only discovers new partitions in an existing table without checking schema. MSCK is faster but limited to partition updates.

Q: How is the Glue Data Catalog different from a Hive Metastore? A: Same concept (metadata store for table definitions), but the Glue Data Catalog is fully managed, has built-in crawlers for auto-discovery, integrates natively with Athena/Redshift/EMR, and includes Lake Formation for fine-grained access control. Hive Metastore requires you to manage the backend database yourself.

Wrapping Up

The Glue Data Catalog and Crawlers are the metadata backbone of any AWS data lake. Without them, your S3 data is just files in folders. With them, your data lake becomes a queryable, discoverable, organized catalog that every analytics tool in AWS can use.

The pattern is simple: store data in S3, crawl it with Glue, query it with Athena. Master this, and you have the foundation for any AWS data platform.

If this guide helped you understand the Glue Data Catalog, share it with your team. Questions? Drop a comment below.

← Previous: AWS S3

AWS (2/4)

Next: AWS Amplify →

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

← AWS S3 AWS Amplify →