Building an On-Premises to Azure Data Pipeline with Self-Hosted Integration Runtime
In every previous pipeline on this blog, both source and destination were cloud resources — Azure SQL Database and ADLS Gen2. The default Azure IR handled everything. But in the real world, most enterprise data lives on-premises — behind corporate firewalls, inside private networks, with no public endpoint.
This is where the Self-Hosted Integration Runtime (SHIR) comes in. It is a lightweight Windows agent you install on a machine inside your private network that acts as a bridge between your on-premises data and Azure.
In this guide, I will walk you through building a complete on-premises to Azure data pipeline — from creating a simulated on-premises environment (an Azure VM with SQL Server) to installing SHIR, building the metadata-driven pipeline, and copying data to ADLS Gen2. This is the exact pattern enterprises use for hybrid cloud data migration.
Table of Contents
- Why Self-Hosted IR?
- Architecture: How SHIR Works
- What We Are Building
- Part 1: Create the Simulated On-Premises Environment
- Part 2: Set Up SQL Server and Create Sample Data
- Part 3: Install and Register Self-Hosted IR
- Part 4: Create the Linked Service via SHIR
- Part 5: Create Datasets
- Part 6: Build the Pipeline
- Part 7: Run, Debug, and Verify
- SHIR Under the Hood
- SHIR vs Azure IR: Key Differences
- High Availability with Multiple Nodes
- Network Requirements
- Monitoring and Troubleshooting SHIR
- Cost Considerations
- Production Best Practices
- Interview Questions
- Wrapping Up
Why Self-Hosted IR?
Azure IR can only reach public endpoints — resources accessible over the internet. It cannot access:
- On-premises SQL Server behind a corporate firewall
- Oracle databases in a private data center
- File shares on a company network
- VMs in an Azure VNet without public endpoints
- Any resource requiring VPN or ExpressRoute access
SHIR solves this by sitting inside your private network and creating outbound-only HTTPS connections to Azure. No inbound firewall rules needed — this is a major security advantage.
Architecture: How SHIR Works
On-Premises Network Azure Cloud
+-----------------------------------+ +---------------------------+
| | | |
| SQL Server (source data) | | Synapse Studio |
| | | | (orchestration) |
| v | | | |
| Self-Hosted IR (Windows service) |------> | |
| | | | v |
| +------------------------------------> ADLS Gen2 (sink) |
| | | |
+-----------------------------------+ +---------------------------+
Outbound HTTPS (443)
No inbound ports needed
Here is what happens during a pipeline run:
- You trigger the pipeline in Synapse Studio
- Synapse sends job instructions to the SHIR through Azure Service Bus
- SHIR receives the instructions: “Read from SQL Server, table dbo.Employee”
- SHIR connects to SQL Server locally (same network, no firewall issues)
- SHIR reads the data, compresses and encrypts it
- SHIR pushes the encrypted data directly to ADLS Gen2 (outbound HTTPS)
- SHIR reports back to Synapse: “Done, 1000 rows copied”
The data never passes through the Synapse service itself — it goes directly from SHIR to ADLS Gen2. Synapse only orchestrates.
What We Are Building
A metadata-driven pipeline that copies multiple tables from an on-premises SQL Server to ADLS Gen2 in Parquet format, using the same Lookup, ForEach, Copy pattern from our previous posts — but with a Self-Hosted IR for the source connection.
Pipeline: PL_OnPrem_To_Azure
|
|-- Lookup_Config (reads ConfigTable from on-prem SQL via SHIR)
|-- ForEach_Table
|-- Copy_TableData
|-- Source: On-prem SQL Server (via SHIR)
|-- Sink: ADLS Gen2 Parquet (via Azure IR)
Part 1: Create the Simulated On-Premises Environment
Since you may not have an actual on-premises server, we simulate one using an Azure VM with SQL Server. From the SHIR perspective, it behaves identically to a physical server in a data center.
Create the VM
- Azure Portal > Create a resource > Virtual Machine
- Configure:
- Name:
naveen-sql-vm - Region: Canada Central (same as your Synapse workspace)
- Image: SQL Server 2022 Developer on Windows Server 2022 (search in marketplace)
- Size: Standard_B2s (2 vCPU, 4 GB RAM — cheapest option, approximately $1-2/day)
- Username:
sqladmin - Password: set a strong password
- Public inbound ports: Allow RDP (3389)
- On the SQL Server settings tab:
- SQL connectivity: Public (for initial setup)
- SQL authentication: Enable
- SQL login:
sqladminwith your password - Click Review + Create > Create
Open SQL Server Port
- Go to the VM > Networking > Network settings
- Create port rule > Inbound port rule
- Destination port:
1433, Protocol: TCP, Action: Allow - Name:
Allow_SQL_1433 - Click Add
Connect via RDP
- On the VM overview page, click Connect > RDP
- Download the RDP file and open it
- Login with
sqladminand your password
Part 2: Set Up SQL Server and Create Sample Data
RDP into the VM, open SQL Server Management Studio (SSMS), and run:
Create the Database
CREATE DATABASE OnPremDB;
GO
USE OnPremDB;
Create Config Table
CREATE TABLE ConfigTable (
Id INT IDENTITY(1,1),
TableName VARCHAR(100),
SchemaName VARCHAR(50),
FolderName VARCHAR(200)
);
INSERT INTO ConfigTable (TableName, SchemaName, FolderName) VALUES
('Employees', 'dbo', 'onprem/Employees'),
('Departments', 'dbo', 'onprem/Departments'),
('Sales', 'dbo', 'onprem/Sales');
Create Source Tables with Data
CREATE TABLE Employees (
EmpId INT PRIMARY KEY,
EmpName VARCHAR(100),
Department VARCHAR(50),
Salary INT,
HireDate DATE
);
INSERT INTO Employees VALUES
(1, 'Naveen', 'Engineering', 95000, '2022-03-15'),
(2, 'Priya', 'Sales', 72000, '2023-01-10'),
(3, 'Ravi', 'Engineering', 88000, '2021-08-22'),
(4, 'Anita', 'Marketing', 68000, '2024-02-01'),
(5, 'Vikram', 'Engineering', 91000, '2022-11-30');
CREATE TABLE Departments (
DeptId INT PRIMARY KEY,
DeptName VARCHAR(50),
Location VARCHAR(50),
Manager VARCHAR(100)
);
INSERT INTO Departments VALUES
(1, 'Engineering', 'Toronto', 'Naveen'),
(2, 'Sales', 'Mumbai', 'Priya'),
(3, 'Marketing', 'London', 'Anita'),
(4, 'HR', 'Toronto', 'Sunita');
CREATE TABLE Sales (
SaleId INT PRIMARY KEY,
ProductName VARCHAR(100),
Amount DECIMAL(10,2),
SaleDate DATE,
Region VARCHAR(50)
);
INSERT INTO Sales VALUES
(1, 'Laptop Pro', 1299.99, '2026-01-15', 'North America'),
(2, 'Cloud License', 499.00, '2026-02-10', 'Europe'),
(3, 'Data Platform', 2500.00, '2026-03-05', 'Asia Pacific'),
(4, 'Analytics Suite',899.99, '2026-03-20', 'North America'),
(5, 'Storage Plan', 199.00, '2026-04-01', 'Europe');
Verify
SELECT * FROM ConfigTable; -- 3 rows
SELECT * FROM Employees; -- 5 rows
SELECT * FROM Departments; -- 4 rows
SELECT * FROM Sales; -- 5 rows
Part 3: Install and Register Self-Hosted IR
Create SHIR in Synapse Studio
- Open Synapse Studio > Manage > Integration runtimes
- Click + New > Self-Hosted > Continue
- Name:
SHIR-OnPrem-SQL - Click Create
- Copy Key 1 (authentication key)
Install SHIR on the VM
- RDP into the VM
- Open Edge browser and go to:
https://www.microsoft.com/en-us/download/details.aspx?id=39717 - Download the latest Integration Runtime installer (MSI file)
- Run the installer — click Next through the wizard with defaults
- The Integration Runtime Configuration Manager opens automatically
- Paste the Key 1 from Synapse Studio
- Click Register > Finish
- Wait 1-2 minutes — status should change to Running
Verify in Synapse Studio
Go to Manage > Integration runtimes. Your SHIR-OnPrem-SQL should show status Running with 1 node.
The SHIR runs as a Windows service called Microsoft Integration Runtime Service. It starts automatically when the VM boots.
Part 4: Create the Linked Service via SHIR
LS_OnPrem_SqlServer
- Synapse Studio > Manage > Linked services > + New
- Search for SQL Server (not Azure SQL Database — plain SQL Server)
- Name:
LS_OnPrem_SqlServer - Connect via integration runtime: select
SHIR-OnPrem-SQL - Server name: the VM public IP address (e.g.,
20.xx.xx.xx) orlocalhostif SHIR is on the same machine - Database name:
OnPremDB - Authentication: SQL Authentication
- Username:
sqladmin - Password: your SQL auth password
- Test connection > should succeed > Create
Key Difference from Azure SQL Linked Service
| Property | Azure SQL DB (cloud) | SQL Server via SHIR (on-prem) |
|---|---|---|
| Type | Azure SQL Database | SQL Server |
| Connect via | Azure IR (AutoResolve) | Self-Hosted IR (SHIR-OnPrem-SQL) |
| Server name | server.database.windows.net | IP address or hostname |
| Authentication | SQL Auth or Managed Identity | SQL Auth or Windows Auth |
| Network | Public internet | Private network via SHIR |
Part 5: Create Datasets
A. DS_OnPrem_ConfigTable
- Data > + > Integration dataset > SQL Server
- Name:
DS_OnPrem_ConfigTable - Linked service:
LS_OnPrem_SqlServer - Table:
dbo.ConfigTable
B. DS_OnPrem_SourceTable (Parameterized)
- Data > + > Integration dataset > SQL Server
- Name:
DS_OnPrem_SourceTable - Linked service:
LS_OnPrem_SqlServer - Leave table blank
- Parameters:
SchemaName(String),TableName(String) - Connection tab > Enter manually:
- Schema:
@dataset().SchemaName - Table:
@dataset().TableName
C. DS_ADLS_Sink_Parquet (Reuse or Create)
If you already have this from previous tutorials, reuse it. Otherwise:
- Data > + > Integration dataset > ADLS Gen2 > Parquet
- Name:
DS_ADLS_Sink_Parquet - Linked service: default workspace storage
- Parameters:
FolderName(String) - Connection tab:
- Container:
synapse-workspace(or your container) - Directory:
@dataset().FolderName
Part 6: Build the Pipeline
Create pipeline: PL_OnPrem_To_Azure
A. Lookup_Config
- Drag Lookup activity
- Name:
Lookup_Config - Settings:
- Dataset:
DS_OnPrem_ConfigTable - First row only: UNCHECKED
B. ForEach_Table
- Drag ForEach activity
- Name:
ForEach_Table - Connect: green arrow from Lookup_Config
- Settings:
- Items:
@activity('Lookup_Config').output.value - Sequential: unchecked
C. Inside ForEach — Copy_TableData
- Click pencil icon on ForEach to open inner canvas
- Drag Copy data activity
- Name:
Copy_TableData
Source tab:
– Dataset: DS_OnPrem_SourceTable
– SchemaName: @item().SchemaName
– TableName: @item().TableName
Sink tab:
– Dataset: DS_ADLS_Sink_Parquet
– FolderName: @item().FolderName
Part 7: Run, Debug, and Verify
- Click Validate — fix any errors
- Click Publish all
- Click Debug
Expected Output
Lookup_Config > Succeeded (3 rows)
ForEach_Table > Succeeded
Copy (Employees) > Succeeded (5 rows, via SHIR-OnPrem-SQL)
Copy (Departments) > Succeeded (4 rows, via SHIR-OnPrem-SQL)
Copy (Sales) > Succeeded (5 rows, via SHIR-OnPrem-SQL)
Verify in ADLS Gen2
synapse-workspace/
onprem/
Employees/
part-00000.snappy.parquet
Departments/
part-00000.snappy.parquet
Sales/
part-00000.snappy.parquet
Check the Integration Runtime Used
In the Monitor output, click on a Copy activity. The output shows:
{
"effectiveIntegrationRuntime": "SHIR-OnPrem-SQL",
"executionDetails": [{
"source": {"type": "SqlServer"},
"sink": {"type": "AzureBlobFS"}
}]
}
Notice effectiveIntegrationRuntime shows SHIR-OnPrem-SQL — confirming data was read through the Self-Hosted IR, not the default Azure IR.
SHIR Under the Hood
How Data Flows
1. Synapse sends job instructions to SHIR via Azure Service Bus (outbound poll)
2. SHIR connects to SQL Server locally (same network)
3. SHIR executes: SELECT * FROM dbo.Employees
4. SHIR reads result set into memory
5. SHIR compresses the data (Snappy for Parquet)
6. SHIR encrypts and pushes data to ADLS Gen2 (outbound HTTPS)
7. SHIR reports completion to Synapse
The data goes directly from SHIR to ADLS Gen2. It never passes through the Synapse service. Synapse only orchestrates.
The SHIR Windows Service
After installation, SHIR runs as Microsoft Integration Runtime Service in Windows Services. It starts automatically on boot and runs continuously, polling Azure Service Bus for work.
You can monitor it through the Integration Runtime Configuration Manager desktop app on the VM.
SHIR vs Azure IR: Key Differences
| Feature | Azure IR | Self-Hosted IR |
|---|---|---|
| Managed by | Microsoft | You (install and maintain) |
| Runs on | Azure managed compute | Your Windows machine/VM |
| Reaches | Public endpoints only | Private networks and on-premises |
| Data Flows | Supported | Not supported |
| Cost | Per DIU-hour | Free (but you pay for the machine) |
| Setup | Instant (default exists) | 15-30 minutes (download, install, register) |
| High availability | Built-in | Multi-node (up to 4 machines) |
| Maintenance | Automatic updates | Manual or auto-update (configurable) |
High Availability with Multiple Nodes
For production, install SHIR on multiple machines:
- Install SHIR on Machine 1 using Key 1
- Install SHIR on Machine 2 using the same key
- Both nodes register under the same logical IR name
- If Machine 1 goes down, Machine 2 handles all traffic
Up to 4 nodes per SHIR. The nodes automatically load-balance concurrent copy activities.
Network Requirements
| Direction | Port | Protocol | Purpose |
|---|---|---|---|
| Outbound | 443 | HTTPS | Communication with Azure Service Bus and ADLS Gen2 |
| Outbound | 443 | HTTPS | Azure Resource Manager (registration) |
| Inbound | None | — | No inbound rules required |
| Local | 1433 | TCP | SQL Server (or whatever port your database uses) |
The SHIR only makes outbound connections. This is why it works in heavily restricted corporate networks — outbound HTTPS on port 443 is almost always allowed.
Monitoring and Troubleshooting SHIR
Check SHIR Status
- Synapse Studio: Manage > Integration runtimes > click on SHIR name
- On the VM: Open Integration Runtime Configuration Manager
- Diagnostics: The Configuration Manager has a Diagnostics tab that tests connectivity
Common Issues
| Issue | Cause | Fix |
|---|---|---|
| SHIR shows Offline | VM is stopped or IR service crashed | Start the VM; restart the IR service |
| “Unable to connect to data source” | SQL Server not accessible from SHIR machine | Verify SQL port 1433 is open; test with telnet localhost 1433 |
| “Registration failed” | Wrong authentication key | Copy the key again from Synapse Studio; ensure no extra spaces |
| Slow data transfer | Undersized VM or network bottleneck | Upgrade VM size; check network bandwidth |
| SHIR auto-update failed | No internet during update window | Manually update by downloading latest installer |
Check SHIR Logs
On the VM, SHIR logs are located at:
C:\Program Files\Microsoft Integration Runtime.0\Shared Logs IntegrationRuntime.log
IntegrationRuntimeService.log
Cost Considerations
| Component | Cost | How to Minimize |
|---|---|---|
| Azure VM (B2s) | ~$1-2/day when running | Stop (deallocate) when not using it — only ~$5/month for disk |
| SQL Server Developer | Free | Use Developer edition for learning |
| SHIR software | Free | No charge for the IR itself |
| ADLS Gen2 storage | ~$0.02/GB/month | Minimal for learning data volumes |
Critical tip: Always Stop (Deallocate) the VM from Azure Portal when you are done. A running VM costs money 24/7. A stopped VM only costs for disk storage.
Production Best Practices
-
Install SHIR on a dedicated machine — do not share with application servers. SHIR can be CPU and memory intensive during large data transfers.
-
Use multiple nodes for high availability — minimum 2 nodes in production.
-
Use Windows Authentication instead of SQL Authentication when possible — integrates with Active Directory, more secure.
-
Keep SHIR updated — enable auto-update or schedule regular manual updates.
-
Monitor the SHIR machine — set up alerts for CPU, memory, and disk usage. A maxed-out SHIR machine means slow or failed copies.
-
Use ExpressRoute or VPN for production — instead of public IP, connect the SHIR machine to Azure via ExpressRoute or Site-to-Site VPN for better security and performance.
-
Separate SHIR per environment — Dev SHIR connects to dev sources. Prod SHIR connects to prod sources. Never share.
Interview Questions
Q: What is a Self-Hosted Integration Runtime? A: A Windows agent installed on a machine inside your private network that acts as a bridge between on-premises data sources and Azure Data Factory/Synapse. It makes outbound-only HTTPS connections to Azure, requiring no inbound firewall rules.
Q: When do you need a Self-Hosted IR? A: When your data source is behind a firewall, in a private network, or on-premises. Azure IR can only reach public endpoints. SHIR enables ADF/Synapse to access private data stores.
Q: How does SHIR transfer data? A: SHIR reads data from the local source, compresses and encrypts it, then pushes it directly to the Azure destination (ADLS Gen2, Azure SQL, etc.) via outbound HTTPS. The data never passes through the ADF/Synapse service itself.
Q: Does SHIR support Data Flows? A: No. Data Flows require a Spark cluster, which only Azure IR provides. To transform on-premises data, first copy it to ADLS Gen2 using SHIR, then run a Data Flow using Azure IR.
Q: How do you make SHIR highly available? A: Install SHIR on up to 4 machines using the same authentication key. All nodes register under one logical IR. If one node fails, others continue processing automatically.
Q: What are the network requirements? A: Only outbound HTTPS on port 443. No inbound ports needed. The SHIR machine needs local network access to the source database and outbound internet to Azure.
Wrapping Up
The Self-Hosted Integration Runtime bridges the gap between your private data and the Azure cloud. The pipeline pattern is identical to what you already know — Lookup, ForEach, Copy — the only difference is the linked service uses SHIR instead of Azure IR.
In enterprise projects, SHIR is one of the first things set up because almost every organization has on-premises data that needs to flow to the cloud. Master this pattern, and you can handle any hybrid data integration scenario.
Related posts: – Integration Runtime Types Guide – Metadata-Driven Pipeline in ADF – Synapse Pipeline with Audit Logging – Azure Fundamentals (IAM, Subscriptions) – What is Azure Data Factory?
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.