AWS Glue Data Catalog and Crawlers Explained: Metadata Management for Your Data Lake
If you have data sitting in Amazon S3 — CSV files, Parquet files, JSON files — how do you query it with SQL? You cannot just point Athena at a folder and say “SELECT *”. Something needs to understand the structure of those files first: what columns exist, what data types they are, where the files are located, and how they are partitioned.
That something is the AWS Glue Data Catalog. And the tool that automatically discovers and registers your data is the Glue Crawler.
Together, they turn your unstructured data lake into a queryable, organized, metadata-rich catalog that Athena, Redshift Spectrum, EMR, and Spark can all use.
This post covers both in detail — what they are, how they work, how to set them up, and how they fit into a production data lake architecture.
Table of Contents
- What Is the AWS Glue Data Catalog?
- What Is a Glue Crawler?
- How They Work Together
- The Data Catalog Hierarchy
- Step 1: Create a Glue Database
- Step 2: Create and Run a Crawler
- Step 3: Explore the Catalog Tables
- Step 4: Query with Athena
- Crawler Configuration Options
- How Crawlers Detect Schema
- Handling Schema Changes
- Partitions and How Crawlers Discover Them
- Scheduling Crawlers
- Manual Table Definitions (Without Crawlers)
- Data Catalog vs Hive Metastore
- Glue Data Catalog in the AWS Ecosystem
- IAM Permissions for Glue
- Cost and Pricing
- Best Practices
- Common Errors and Fixes
- Comparison: Glue Catalog vs Azure Equivalent
- Interview Questions
- Wrapping Up
What Is the AWS Glue Data Catalog?
The AWS Glue Data Catalog is a centralized metadata repository for all your data assets. It stores information about your data — where it lives, what format it is in, what columns it has, and how it is partitioned — but it does NOT store the actual data.
Think of it as a library catalog. The catalog tells you “there is a book called ‘Customer Data’ on shelf S3, section bronze, written in Parquet format, with chapters (columns) for id, name, email, and city.” But the actual book (data) remains on the shelf (S3).
What It Stores
| Metadata | Example |
|---|---|
| Database | datalake_bronze |
| Table name | customers |
| Location | s3://my-datalake/bronze/customers/ |
| Format | Parquet |
| Columns | id (int), name (string), email (string), city (string) |
| Partitions | year=2026/month=04/day=07 |
| Row count (estimated) | 1,000,000 |
| Table properties | classification=parquet, compressionType=snappy |
Who Uses the Catalog
| Service | How It Uses the Catalog |
|---|---|
| Amazon Athena | Reads table definitions to query S3 data with SQL |
| Amazon Redshift Spectrum | Uses catalog tables for external table queries |
| AWS Glue ETL | Reads/writes catalog tables in Spark jobs |
| Amazon EMR | Uses catalog as the Hive metastore replacement |
| AWS Lake Formation | Builds access control on top of catalog tables |
| Amazon QuickSight | Discovers datasets through the catalog |
The Data Catalog is the single source of truth for metadata across the entire AWS analytics ecosystem.
What Is a Glue Crawler?
A Glue Crawler is an automated metadata discovery tool. You point it at an S3 location (or a database), and it:
- Scans the files in that location
- Reads sample data to infer the schema (columns and data types)
- Detects the file format (Parquet, CSV, JSON, Avro, ORC)
- Discovers partitions (year=2026/month=04/ folder structure)
- Creates or updates table definitions in the Data Catalog
Without a crawler, you would have to manually define every table — specify every column name, every data type, every partition. For a data lake with 200 tables, that is unsustainable.
With a crawler, you point it at your S3 prefix, run it, and it creates all 200 table definitions automatically.
How They Work Together
S3 Data Lake Glue Crawler Data Catalog
+------------------+ +----------+ +------------------+
| bronze/ | | | | Database: |
| customers/ | -------> | Scans | ---------> | datalake_bronze|
| data.parquet | | files, | | |
| orders/ | | infers | | Tables: |
| data.parquet | | schema | | customers |
| products/ | | | | orders |
| data.parquet | +----------+ | products |
+------------------+ +------------------+
|
v
Amazon Athena
"SELECT * FROM
datalake_bronze.customers"
The Data Catalog Hierarchy
AWS Account
|-- Glue Data Catalog (one per region)
|-- Database: datalake_bronze
| |-- Table: customers (s3://bucket/bronze/customers/)
| |-- Table: orders (s3://bucket/bronze/orders/)
| |-- Table: products (s3://bucket/bronze/products/)
|
|-- Database: datalake_silver
| |-- Table: customers_cleaned
| |-- Table: orders_enriched
|
|-- Database: datalake_gold
|-- Table: dim_customer
|-- Table: fact_sales
Database = a logical grouping of tables (like a schema in SQL Server). Does not store data.
Table = metadata about one dataset. Points to an S3 location. Defines columns, types, format, and partitions.
Step 1: Create a Glue Database
Using AWS Console
- Go to AWS Glue Console > Databases (under Data Catalog in the left sidebar)
- Click Add database
- Name:
datalake_bronze - Description: “Raw data ingested from source systems”
- Click Create database
Using AWS CLI
aws glue create-database --database-input '{
"Name": "datalake_bronze",
"Description": "Raw data ingested from source systems"
}'
Using Python (boto3)
import boto3
glue = boto3.client('glue')
glue.create_database(
DatabaseInput={
'Name': 'datalake_bronze',
'Description': 'Raw data ingested from source systems'
}
)
Step 2: Create and Run a Crawler
Prerequisites
Make sure you have data in S3:
s3://my-datalake/bronze/customers/part-00000.snappy.parquet
s3://my-datalake/bronze/orders/part-00000.snappy.parquet
s3://my-datalake/bronze/products/part-00000.snappy.parquet
Create Crawler in Console
- Go to AWS Glue Console > Crawlers (under Data Catalog)
- Click Create crawler
- Name:
bronze-crawler - Click Next
- Data sources:
- Click Add a data source
- Data source: S3
- S3 path:
s3://my-datalake/bronze/ - Crawl all sub-folders: Yes
- Click Add an S3 data source
- Click Next
- IAM role:
- Create new IAM role:
AWSGlueServiceRole-bronze-crawler - This role needs
AmazonS3ReadOnlyAccessandAWSGlueServiceRolepolicies - Click Next
- Target database:
datalake_bronze - Table name prefix: leave blank (or add a prefix like
raw_) - Click Next
- Schedule: Run on demand (or set a schedule)
- Click Create crawler
Run the Crawler
- Select your crawler > click Run crawler
- Wait 1-3 minutes
- Status changes to Succeeded
- Check Tables added: should show 3 (customers, orders, products)
What the Crawler Created
Go to Databases > datalake_bronze > Tables. You will see:
| Table | Location | Format | Columns |
|---|---|---|---|
| customers | s3://my-datalake/bronze/customers/ | parquet | id (int), name (string), email (string), city (string) |
| orders | s3://my-datalake/bronze/orders/ | parquet | order_id (int), customer_id (int), amount (double), order_date (string) |
| products | s3://my-datalake/bronze/products/ | parquet | product_id (int), name (string), category (string), price (double) |
The crawler automatically detected the Parquet schema, including column names and data types.
Step 3: Explore the Catalog Tables
View Table Details
- Click on a table name (e.g.,
customers) - You see:
- Schema: all columns with data types
- Location: S3 path
- Input format:
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat - Serde: Parquet serialization library
- Table properties: classification, compressionType, etc.
- Partitions: if any were detected
View via AWS CLI
aws glue get-table --database-name datalake_bronze --name customers
View via Python
response = glue.get_table(
DatabaseName='datalake_bronze',
Name='customers'
)
for col in response['Table']['StorageDescriptor']['Columns']:
print(f"{col['Name']}: {col['Type']}")
Step 4: Query with Athena
Once tables are in the Data Catalog, Athena can query them immediately:
-- Select the datalake_bronze database in Athena
-- No data loading needed -- Athena reads directly from S3
SELECT * FROM datalake_bronze.customers
LIMIT 10;
SELECT city, COUNT(*) as customer_count
FROM datalake_bronze.customers
GROUP BY city
ORDER BY customer_count DESC;
-- Join across catalog tables
SELECT c.name, o.amount, o.order_date
FROM datalake_bronze.customers c
JOIN datalake_bronze.orders o ON c.id = o.customer_id
WHERE o.amount > 500;
This is the power of the Data Catalog: you write standard SQL against S3 data without any ETL, loading, or database provisioning.
Crawler Configuration Options
Crawler Behavior on Subsequent Runs
When a crawler runs again on the same data:
| Scenario | Crawler Action |
|---|---|
| New files with same schema | Updates row count, no schema change |
| New files with new columns | Adds new columns to the table |
| New partition folders | Adds new partitions |
| Files deleted | Table remains (does not auto-delete) |
| Completely new data format | Creates a new table |
Configuration Settings
| Setting | Options | Recommendation |
|---|---|---|
| Recrawl policy | Crawl all folders / Crawl new folders only | Crawl new folders only (faster for large lakes) |
| Schema change policy | Update table / Add new columns only / Log changes | Add new columns only (safest) |
| Object deletion policy | Delete from catalog / Mark as deprecated / Log | Mark as deprecated (never auto-delete) |
| Table grouping | Create single table / Create per-folder table | Create per-folder (standard for data lakes) |
| Sample size | Number of files to sample for schema | Default is fine for most cases |
How Crawlers Detect Schema
For Parquet/ORC/Avro
Schema is embedded in the file. The crawler reads the file metadata — no sampling needed. Column names and types are exact.
For CSV
No embedded schema. The crawler: 1. Reads the first row (checks if it looks like a header) 2. Samples several rows to infer data types 3. Makes best guesses: “123” could be int or string
CSV issues: The crawler might guess wrong. “2026-04-07” could be detected as string instead of date. You may need to manually update the table schema after crawling CSV files.
For JSON
Semi-structured. The crawler reads multiple records and infers a schema from the union of all fields. Nested JSON creates struct types.
Handling Schema Changes
New Columns Added to Source
When new files have extra columns:
- Crawler detects new columns
- Based on policy, either adds them to the table or logs a warning
- Existing queries continue to work (new columns are NULL for old data)
Columns Removed from Source
- Crawler does not remove columns from the catalog (they stay with NULL values)
- Old queries referencing removed columns still work against old files
Data Type Changes
- Crawler detects type mismatch (column was int, now it is string)
- Based on policy, either updates the type or logs a warning
- Risk: Changing types can break existing queries
Best practice: Set schema change policy to Add new columns only and handle type changes manually.
Partitions and How Crawlers Discover Them
Hive-Style Partitions
If your S3 data is organized like:
s3://my-datalake/bronze/orders/year=2026/month=04/day=07/data.parquet
s3://my-datalake/bronze/orders/year=2026/month=04/day=08/data.parquet
s3://my-datalake/bronze/orders/year=2026/month=03/day=15/data.parquet
The crawler automatically:
1. Detects year, month, day as partition keys
2. Registers each unique combination as a partition in the catalog
3. Athena can then use partition pruning:
-- Only scans the April 7 partition (not all data)
SELECT * FROM orders WHERE year='2026' AND month='04' AND day='07';
Adding New Partitions
When new daily data arrives, you have two options:
Option A: Rerun the crawler — it discovers new partitions automatically
Option B: Use MSCK REPAIR TABLE in Athena (faster):
MSCK REPAIR TABLE datalake_bronze.orders;
This scans S3 for new partition folders and adds them to the catalog without running the full crawler.
Option C: Use Glue API (most efficient):
glue.batch_create_partition(
DatabaseName='datalake_bronze',
TableName='orders',
PartitionInputList=[{
'Values': ['2026', '04', '09'],
'StorageDescriptor': {
'Location': 's3://my-datalake/bronze/orders/year=2026/month=04/day=09/',
# ... same storage descriptor as the table
}
}]
)
Scheduling Crawlers
On-Demand (Manual)
Run whenever you want from the console or CLI:
aws glue start-crawler --name bronze-crawler
Scheduled (Cron)
Set the crawler to run on a schedule:
Daily at 3 AM UTC: cron(0 3 * * ? *)
Every 6 hours: cron(0 */6 * * ? *)
Monday at midnight: cron(0 0 ? * MON *)
Event-Driven
Trigger the crawler when new data arrives: 1. S3 event notification triggers a Lambda function 2. Lambda function starts the crawler
# Lambda function triggered by S3 event
def handler(event, context):
glue = boto3.client('glue')
glue.start_crawler(Name='bronze-crawler')
Manual Table Definitions (Without Crawlers)
You do not always need a crawler. You can define tables manually:
Using Athena DDL
CREATE EXTERNAL TABLE datalake_bronze.customers (
id INT,
name STRING,
email STRING,
city STRING
)
STORED AS PARQUET
LOCATION 's3://my-datalake/bronze/customers/'
TBLPROPERTIES ('parquet.compression'='SNAPPY');
With Partitions
CREATE EXTERNAL TABLE datalake_bronze.orders (
order_id INT,
customer_id INT,
amount DOUBLE,
order_date STRING
)
PARTITIONED BY (year STRING, month STRING, day STRING)
STORED AS PARQUET
LOCATION 's3://my-datalake/bronze/orders/'
TBLPROPERTIES ('parquet.compression'='SNAPPY');
-- Load partitions
MSCK REPAIR TABLE datalake_bronze.orders;
When to Use Manual vs Crawler
| Scenario | Use Crawler | Use Manual DDL |
|---|---|---|
| Many tables, unknown schemas | Yes | No |
| Schemas change frequently | Yes | No |
| You know the exact schema | No | Yes (more control) |
| CSV with messy headers | No | Yes (you define types correctly) |
| Partitioned data | Either works | Yes (more explicit) |
Data Catalog vs Hive Metastore
The Glue Data Catalog serves the same purpose as the Apache Hive Metastore but is fully managed:
| Feature | Hive Metastore | Glue Data Catalog |
|---|---|---|
| Management | You manage (MySQL/PostgreSQL backend) | Fully managed by AWS |
| Availability | Depends on your setup | Built-in HA |
| Integration | Hive, Spark | Athena, Redshift, Glue, EMR, Lake Formation |
| Cost | You pay for infra | Pay per API call |
| Crawlers | No built-in discovery | Built-in crawlers |
| Access control | Basic | Lake Formation fine-grained access |
Most AWS customers have migrated from self-managed Hive Metastore to Glue Data Catalog.
Glue Data Catalog in the AWS Ecosystem
S3 (Data Lake)
|
|-- Glue Crawler --> Glue Data Catalog (metadata)
| |
| |-- Athena (SQL queries)
| |-- Redshift Spectrum (external tables)
| |-- EMR/Spark (table references)
| |-- Glue ETL (source/sink)
| |-- Lake Formation (access control)
| |-- QuickSight (data discovery)
|
|-- Glue ETL Jobs (transform data)
|-- Glue Workflows (orchestrate ETL)
IAM Permissions for Glue
Crawler IAM Role
The crawler needs an IAM role with:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::my-datalake",
"arn:aws:s3:::my-datalake/*"
]
},
{
"Effect": "Allow",
"Action": [
"glue:*"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "arn:aws:logs:*:*:*"
}
]
}
Also attach the managed policy: AWSGlueServiceRole.
User Permissions
For a data engineer to use the catalog:
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:GetDatabases",
"glue:GetTable",
"glue:GetTables",
"glue:GetPartitions",
"glue:CreateTable",
"glue:UpdateTable",
"glue:BatchCreatePartition"
],
"Resource": "*"
}
Cost and Pricing
| Component | Cost |
|---|---|
| Storing metadata | First 1 million objects free, then $1 per 100,000 objects/month |
| API requests | First 1 million free, then $1 per 1 million requests |
| Crawler runtime | $0.44 per DPU-hour (usually runs for a few minutes) |
For most data lakes, the Data Catalog cost is negligible — often under $5/month.
Best Practices
- One database per data layer:
datalake_bronze,datalake_silver,datalake_gold - Use Parquet instead of CSV — crawlers detect Parquet schema perfectly
- Use Hive-style partitions (year=2026/month=04/) for automatic partition discovery
- Set crawler schema policy to “Add new columns only” — prevents breaking changes
- Schedule crawlers after pipeline runs — crawl new data promptly
- Use MSCK REPAIR TABLE for adding partitions in between crawler runs
- Name tables consistently — snake_case, match the source system names
- Add table descriptions — help other team members discover and understand data
Common Errors and Fixes
| Error | Cause | Fix |
|---|---|---|
| “Database not found” | Database does not exist in the catalog | Create the database first |
| “Access denied” on S3 | Crawler IAM role cannot read S3 | Add s3:GetObject and s3:ListBucket to the role |
| Crawler creates too many tables | Files in different formats in the same folder | Separate files by format or use classifiers |
| Wrong data types for CSV | Crawler guessed types incorrectly | Manually update the table schema in the catalog |
| Missing partitions | Crawler has not run since new data arrived | Run crawler or MSCK REPAIR TABLE |
| Table shows 0 rows in Athena | Partitions not registered | Run MSCK REPAIR TABLE |
Comparison: Glue Catalog vs Azure Equivalent
| Feature | AWS Glue Data Catalog | Azure Equivalent |
|---|---|---|
| Metadata store | Glue Data Catalog | Azure Purview / Synapse Serverless external tables |
| Auto-discovery | Glue Crawler | Purview scanning / manual DDL |
| Query engine | Athena | Synapse Serverless SQL |
| Access control | Lake Formation | Synapse workspace RBAC + ADLS ACLs |
| Cost | Per API call (very cheap) | Per query data scanned (Synapse) |
Interview Questions
Q: What is the AWS Glue Data Catalog? A: A centralized metadata repository that stores information about your data assets — table definitions, column schemas, data locations, formats, and partitions. It does not store actual data. It is used by Athena, Redshift Spectrum, Glue ETL, and EMR to understand the structure of data in S3.
Q: What is a Glue Crawler and why do you need it? A: A crawler is an automated tool that scans data in S3 (or other sources), infers the schema, and creates or updates table definitions in the Data Catalog. Without it, you would have to manually define every table and column.
Q: How does a crawler handle schema changes? A: Based on the configured policy. It can add new columns, update existing types, or just log changes. The safest setting is “Add new columns only” which prevents breaking existing queries.
Q: How do Athena queries use the Data Catalog? A: Athena reads table definitions (location, format, schema, partitions) from the Data Catalog. When you run a query, Athena uses this metadata to find the S3 files, apply the correct deserializer, and return structured results. No data is loaded into Athena — it queries S3 directly.
Q: What is the difference between running a crawler and MSCK REPAIR TABLE? A: A crawler scans files and infers schema from scratch (can add columns, detect format changes). MSCK REPAIR TABLE only discovers new partitions in an existing table without checking schema. MSCK is faster but limited to partition updates.
Q: How is the Glue Data Catalog different from a Hive Metastore? A: Same concept (metadata store for table definitions), but the Glue Data Catalog is fully managed, has built-in crawlers for auto-discovery, integrates natively with Athena/Redshift/EMR, and includes Lake Formation for fine-grained access control. Hive Metastore requires you to manage the backend database yourself.
Wrapping Up
The Glue Data Catalog and Crawlers are the metadata backbone of any AWS data lake. Without them, your S3 data is just files in folders. With them, your data lake becomes a queryable, discoverable, organized catalog that every analytics tool in AWS can use.
The pattern is simple: store data in S3, crawl it with Glue, query it with Athena. Master this, and you have the foundation for any AWS data platform.
Related posts: – AWS S3 for Data Engineers – Parquet vs CSV vs JSON – Schema-on-Write vs Schema-on-Read – Building a REST API with FastAPI on AWS Lambda – Python for Data Engineers
If this guide helped you understand the Glue Data Catalog, share it with your team. Questions? Drop a comment below.
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.