Building an On-Premises to Azure Data Pipeline with Self-Hosted Integration Runtime

In every previous pipeline on this blog, both source and destination were cloud resources — Azure SQL Database and ADLS Gen2. The default Azure IR handled everything. But in the real world, most enterprise data lives on-premises — behind corporate firewalls, inside private networks, with no public endpoint.

This is where the Self-Hosted Integration Runtime (SHIR) comes in. It is a lightweight Windows agent you install on a machine inside your private network that acts as a bridge between your on-premises data and Azure.

In this guide, I will walk you through building a complete on-premises to Azure data pipeline — from creating a simulated on-premises environment (an Azure VM with SQL Server) to installing SHIR, building the metadata-driven pipeline, and copying data to ADLS Gen2. This is the exact pattern enterprises use for hybrid cloud data migration.

Why Self-Hosted IR?
Architecture: How SHIR Works
What We Are Building
Part 1: Create the Simulated On-Premises Environment
Part 2: Set Up SQL Server and Create Sample Data
Part 3: Install and Register Self-Hosted IR
Part 4: Create the Linked Service via SHIR
Part 5: Create Datasets
Part 6: Build the Pipeline
Part 7: Run, Debug, and Verify
SHIR Under the Hood
SHIR vs Azure IR: Key Differences
High Availability with Multiple Nodes
Network Requirements
Monitoring and Troubleshooting SHIR
Cost Considerations
Production Best Practices
Interview Questions
Wrapping Up

Why Self-Hosted IR?

Azure IR can only reach public endpoints — resources accessible over the internet. It cannot access:

On-premises SQL Server behind a corporate firewall
Oracle databases in a private data center
File shares on a company network
VMs in an Azure VNet without public endpoints
Any resource requiring VPN or ExpressRoute access

SHIR solves this by sitting inside your private network and creating outbound-only HTTPS connections to Azure. No inbound firewall rules needed — this is a major security advantage.

Architecture: How SHIR Works

On-Premises Network                    Azure Cloud
+-----------------------------------+  +---------------------------+
|                                   |  |                           |
|  SQL Server (source data)         |  |  Synapse Studio           |
|       |                           |  |  (orchestration)          |
|       v                           |  |       |                   |
|  Self-Hosted IR (Windows service) |------>   |                   |
|       |                           |  |       v                   |
|       +------------------------------------> ADLS Gen2 (sink)    |
|                                   |  |                           |
+-----------------------------------+  +---------------------------+
         Outbound HTTPS (443)
         No inbound ports needed

Here is what happens during a pipeline run:

You trigger the pipeline in Synapse Studio
Synapse sends job instructions to the SHIR through Azure Service Bus
SHIR receives the instructions: “Read from SQL Server, table dbo.Employee”
SHIR connects to SQL Server locally (same network, no firewall issues)
SHIR reads the data, compresses and encrypts it
SHIR pushes the encrypted data directly to ADLS Gen2 (outbound HTTPS)
SHIR reports back to Synapse: “Done, 1000 rows copied”

The data never passes through the Synapse service itself — it goes directly from SHIR to ADLS Gen2. Synapse only orchestrates.

What We Are Building

A metadata-driven pipeline that copies multiple tables from an on-premises SQL Server to ADLS Gen2 in Parquet format, using the same Lookup, ForEach, Copy pattern from our previous posts — but with a Self-Hosted IR for the source connection.

Pipeline: PL_OnPrem_To_Azure
|
|-- Lookup_Config (reads ConfigTable from on-prem SQL via SHIR)
|-- ForEach_Table
    |-- Copy_TableData
        |-- Source: On-prem SQL Server (via SHIR)
        |-- Sink: ADLS Gen2 Parquet (via Azure IR)

Part 1: Create the Simulated On-Premises Environment

Since you may not have an actual on-premises server, we simulate one using an Azure VM with SQL Server. From the SHIR perspective, it behaves identically to a physical server in a data center.

Create the VM

Azure Portal > Create a resource > Virtual Machine
Configure:
Name: naveen-sql-vm
Region: Canada Central (same as your Synapse workspace)
Image: SQL Server 2022 Developer on Windows Server 2022 (search in marketplace)
Size: Standard_B2s (2 vCPU, 4 GB RAM — cheapest option, approximately $1-2/day)
Username: sqladmin
Password: set a strong password
Public inbound ports: Allow RDP (3389)
On the SQL Server settings tab:
SQL connectivity: Public (for initial setup)
SQL authentication: Enable
SQL login: sqladmin with your password
Click Review + Create > Create

Open SQL Server Port

Go to the VM > Networking > Network settings
Create port rule > Inbound port rule
Destination port: 1433, Protocol: TCP, Action: Allow
Name: Allow_SQL_1433
Click Add

Connect via RDP

On the VM overview page, click Connect > RDP
Download the RDP file and open it
Login with sqladmin and your password

Part 2: Set Up SQL Server and Create Sample Data

RDP into the VM, open SQL Server Management Studio (SSMS), and run:

Create the Database

CREATE DATABASE OnPremDB;
GO
USE OnPremDB;

Create Config Table

CREATE TABLE ConfigTable (
    Id              INT IDENTITY(1,1),
    TableName       VARCHAR(100),
    SchemaName      VARCHAR(50),
    FolderName      VARCHAR(200)
);

INSERT INTO ConfigTable (TableName, SchemaName, FolderName) VALUES
('Employees',   'dbo', 'onprem/Employees'),
('Departments', 'dbo', 'onprem/Departments'),
('Sales',       'dbo', 'onprem/Sales');

Create Source Tables with Data

CREATE TABLE Employees (
    EmpId       INT PRIMARY KEY,
    EmpName     VARCHAR(100),
    Department  VARCHAR(50),
    Salary      INT,
    HireDate    DATE
);

INSERT INTO Employees VALUES
(1, 'Naveen',  'Engineering', 95000, '2022-03-15'),
(2, 'Priya',   'Sales',       72000, '2023-01-10'),
(3, 'Ravi',    'Engineering', 88000, '2021-08-22'),
(4, 'Anita',   'Marketing',   68000, '2024-02-01'),
(5, 'Vikram',  'Engineering', 91000, '2022-11-30');

CREATE TABLE Departments (
    DeptId      INT PRIMARY KEY,
    DeptName    VARCHAR(50),
    Location    VARCHAR(50),
    Manager     VARCHAR(100)
);

INSERT INTO Departments VALUES
(1, 'Engineering', 'Toronto',  'Naveen'),
(2, 'Sales',       'Mumbai',   'Priya'),
(3, 'Marketing',   'London',   'Anita'),
(4, 'HR',          'Toronto',  'Sunita');

CREATE TABLE Sales (
    SaleId      INT PRIMARY KEY,
    ProductName VARCHAR(100),
    Amount      DECIMAL(10,2),
    SaleDate    DATE,
    Region      VARCHAR(50)
);

INSERT INTO Sales VALUES
(1, 'Laptop Pro',    1299.99, '2026-01-15', 'North America'),
(2, 'Cloud License', 499.00,  '2026-02-10', 'Europe'),
(3, 'Data Platform', 2500.00, '2026-03-05', 'Asia Pacific'),
(4, 'Analytics Suite',899.99, '2026-03-20', 'North America'),
(5, 'Storage Plan',   199.00, '2026-04-01', 'Europe');

Verify

SELECT * FROM ConfigTable;    -- 3 rows
SELECT * FROM Employees;      -- 5 rows
SELECT * FROM Departments;    -- 4 rows
SELECT * FROM Sales;          -- 5 rows

Part 3: Install and Register Self-Hosted IR

Create SHIR in Synapse Studio

Open Synapse Studio > Manage > Integration runtimes
Click + New > Self-Hosted > Continue
Name: SHIR-OnPrem-SQL
Click Create
Copy Key 1 (authentication key)

Install SHIR on the VM

RDP into the VM
Open Edge browser and go to: https://www.microsoft.com/en-us/download/details.aspx?id=39717
Download the latest Integration Runtime installer (MSI file)
Run the installer — click Next through the wizard with defaults
The Integration Runtime Configuration Manager opens automatically
Paste the Key 1 from Synapse Studio
Click Register > Finish
Wait 1-2 minutes — status should change to Running

Verify in Synapse Studio

Go to Manage > Integration runtimes. Your SHIR-OnPrem-SQL should show status Running with 1 node.

The SHIR runs as a Windows service called Microsoft Integration Runtime Service. It starts automatically when the VM boots.

Part 4: Create the Linked Service via SHIR

LS_OnPrem_SqlServer

Synapse Studio > Manage > Linked services > + New
Search for SQL Server (not Azure SQL Database — plain SQL Server)
Name: LS_OnPrem_SqlServer
Connect via integration runtime: select SHIR-OnPrem-SQL
Server name: the VM public IP address (e.g., 20.xx.xx.xx) or localhost if SHIR is on the same machine
Database name: OnPremDB
Authentication: SQL Authentication
Username: sqladmin
Password: your SQL auth password
Test connection > should succeed > Create

Key Difference from Azure SQL Linked Service

Property	Azure SQL DB (cloud)	SQL Server via SHIR (on-prem)
Type	Azure SQL Database	SQL Server
Connect via	Azure IR (AutoResolve)	Self-Hosted IR (SHIR-OnPrem-SQL)
Server name	server.database.windows.net	IP address or hostname
Authentication	SQL Auth or Managed Identity	SQL Auth or Windows Auth
Network	Public internet	Private network via SHIR

Part 5: Create Datasets

A. DS_OnPrem_ConfigTable

Data > + > Integration dataset > SQL Server
Name: DS_OnPrem_ConfigTable
Linked service: LS_OnPrem_SqlServer
Table: dbo.ConfigTable

B. DS_OnPrem_SourceTable (Parameterized)

Data > + > Integration dataset > SQL Server
Name: DS_OnPrem_SourceTable
Linked service: LS_OnPrem_SqlServer
Leave table blank
Parameters: SchemaName (String), TableName (String)
Connection tab > Enter manually:
Schema: @dataset().SchemaName
Table: @dataset().TableName

C. DS_ADLS_Sink_Parquet (Reuse or Create)

If you already have this from previous tutorials, reuse it. Otherwise:

Data > + > Integration dataset > ADLS Gen2 > Parquet
Name: DS_ADLS_Sink_Parquet
Linked service: default workspace storage
Parameters: FolderName (String)
Connection tab:
Container: synapse-workspace (or your container)
Directory: @dataset().FolderName

Part 6: Build the Pipeline

Create pipeline: PL_OnPrem_To_Azure

A. Lookup_Config

Drag Lookup activity
Name: Lookup_Config
Settings:
Dataset: DS_OnPrem_ConfigTable
First row only: UNCHECKED

B. ForEach_Table

Drag ForEach activity
Name: ForEach_Table
Connect: green arrow from Lookup_Config
Settings:
Items: @activity('Lookup_Config').output.value
Sequential: unchecked

C. Inside ForEach — Copy_TableData

Click pencil icon on ForEach to open inner canvas
Drag Copy data activity
Name: Copy_TableData

Source tab: – Dataset: DS_OnPrem_SourceTable – SchemaName: @item().SchemaName – TableName: @item().TableName

Sink tab: – Dataset: DS_ADLS_Sink_Parquet – FolderName: @item().FolderName

Part 7: Run, Debug, and Verify

Click Validate — fix any errors
Click Publish all
Click Debug

Expected Output

Lookup_Config        > Succeeded (3 rows)
ForEach_Table        > Succeeded
  Copy (Employees)   > Succeeded (5 rows, via SHIR-OnPrem-SQL)
  Copy (Departments) > Succeeded (4 rows, via SHIR-OnPrem-SQL)
  Copy (Sales)       > Succeeded (5 rows, via SHIR-OnPrem-SQL)

Verify in ADLS Gen2

synapse-workspace/
  onprem/
    Employees/
      part-00000.snappy.parquet
    Departments/
      part-00000.snappy.parquet
    Sales/
      part-00000.snappy.parquet

Check the Integration Runtime Used

In the Monitor output, click on a Copy activity. The output shows:

{
    "effectiveIntegrationRuntime": "SHIR-OnPrem-SQL",
    "executionDetails": [{
        "source": {"type": "SqlServer"},
        "sink": {"type": "AzureBlobFS"}
    }]
}

Notice effectiveIntegrationRuntime shows SHIR-OnPrem-SQL — confirming data was read through the Self-Hosted IR, not the default Azure IR.

SHIR Under the Hood

How Data Flows

1. Synapse sends job instructions to SHIR via Azure Service Bus (outbound poll)
2. SHIR connects to SQL Server locally (same network)
3. SHIR executes: SELECT * FROM dbo.Employees
4. SHIR reads result set into memory
5. SHIR compresses the data (Snappy for Parquet)
6. SHIR encrypts and pushes data to ADLS Gen2 (outbound HTTPS)
7. SHIR reports completion to Synapse

The data goes directly from SHIR to ADLS Gen2. It never passes through the Synapse service. Synapse only orchestrates.

The SHIR Windows Service

After installation, SHIR runs as Microsoft Integration Runtime Service in Windows Services. It starts automatically on boot and runs continuously, polling Azure Service Bus for work.

You can monitor it through the Integration Runtime Configuration Manager desktop app on the VM.

SHIR vs Azure IR: Key Differences

Feature	Azure IR	Self-Hosted IR
Managed by	Microsoft	You (install and maintain)
Runs on	Azure managed compute	Your Windows machine/VM
Reaches	Public endpoints only	Private networks and on-premises
Data Flows	Supported	Not supported
Cost	Per DIU-hour	Free (but you pay for the machine)
Setup	Instant (default exists)	15-30 minutes (download, install, register)
High availability	Built-in	Multi-node (up to 4 machines)
Maintenance	Automatic updates	Manual or auto-update (configurable)

High Availability with Multiple Nodes

For production, install SHIR on multiple machines:

Install SHIR on Machine 1 using Key 1
Install SHIR on Machine 2 using the same key
Both nodes register under the same logical IR name
If Machine 1 goes down, Machine 2 handles all traffic

Up to 4 nodes per SHIR. The nodes automatically load-balance concurrent copy activities.

Network Requirements

Direction	Port	Protocol	Purpose
Outbound	443	HTTPS	Communication with Azure Service Bus and ADLS Gen2
Outbound	443	HTTPS	Azure Resource Manager (registration)
Inbound	None	—	No inbound rules required
Local	1433	TCP	SQL Server (or whatever port your database uses)

The SHIR only makes outbound connections. This is why it works in heavily restricted corporate networks — outbound HTTPS on port 443 is almost always allowed.

Monitoring and Troubleshooting SHIR

Check SHIR Status

Synapse Studio: Manage > Integration runtimes > click on SHIR name
On the VM: Open Integration Runtime Configuration Manager
Diagnostics: The Configuration Manager has a Diagnostics tab that tests connectivity

Common Issues

Issue	Cause	Fix
SHIR shows Offline	VM is stopped or IR service crashed	Start the VM; restart the IR service
“Unable to connect to data source”	SQL Server not accessible from SHIR machine	Verify SQL port 1433 is open; test with `telnet localhost 1433`
“Registration failed”	Wrong authentication key	Copy the key again from Synapse Studio; ensure no extra spaces
Slow data transfer	Undersized VM or network bottleneck	Upgrade VM size; check network bandwidth
SHIR auto-update failed	No internet during update window	Manually update by downloading latest installer

Check SHIR Logs

On the VM, SHIR logs are located at:

C:\Program Files\Microsoft Integration Runtime.0\Shared  Logs    IntegrationRuntime.log
    IntegrationRuntimeService.log

Cost Considerations

Component	Cost	How to Minimize
Azure VM (B2s)	~$1-2/day when running	Stop (deallocate) when not using it — only ~$5/month for disk
SQL Server Developer	Free	Use Developer edition for learning
SHIR software	Free	No charge for the IR itself
ADLS Gen2 storage	~$0.02/GB/month	Minimal for learning data volumes

Critical tip: Always Stop (Deallocate) the VM from Azure Portal when you are done. A running VM costs money 24/7. A stopped VM only costs for disk storage.

Production Best Practices

Install SHIR on a dedicated machine — do not share with application servers. SHIR can be CPU and memory intensive during large data transfers.
Use multiple nodes for high availability — minimum 2 nodes in production.
Use Windows Authentication instead of SQL Authentication when possible — integrates with Active Directory, more secure.
Keep SHIR updated — enable auto-update or schedule regular manual updates.
Monitor the SHIR machine — set up alerts for CPU, memory, and disk usage. A maxed-out SHIR machine means slow or failed copies.
Use ExpressRoute or VPN for production — instead of public IP, connect the SHIR machine to Azure via ExpressRoute or Site-to-Site VPN for better security and performance.
Separate SHIR per environment — Dev SHIR connects to dev sources. Prod SHIR connects to prod sources. Never share.

Interview Questions

Q: What is a Self-Hosted Integration Runtime? A: A Windows agent installed on a machine inside your private network that acts as a bridge between on-premises data sources and Azure Data Factory/Synapse. It makes outbound-only HTTPS connections to Azure, requiring no inbound firewall rules.

Q: When do you need a Self-Hosted IR? A: When your data source is behind a firewall, in a private network, or on-premises. Azure IR can only reach public endpoints. SHIR enables ADF/Synapse to access private data stores.

Q: How does SHIR transfer data? A: SHIR reads data from the local source, compresses and encrypts it, then pushes it directly to the Azure destination (ADLS Gen2, Azure SQL, etc.) via outbound HTTPS. The data never passes through the ADF/Synapse service itself.

Q: Does SHIR support Data Flows? A: No. Data Flows require a Spark cluster, which only Azure IR provides. To transform on-premises data, first copy it to ADLS Gen2 using SHIR, then run a Data Flow using Azure IR.

Q: How do you make SHIR highly available? A: Install SHIR on up to 4 machines using the same authentication key. All nodes register under one logical IR. If one node fails, others continue processing automatically.

Q: What are the network requirements? A: Only outbound HTTPS on port 443. No inbound ports needed. The SHIR machine needs local network access to the source database and outbound internet to Azure.

Wrapping Up

The Self-Hosted Integration Runtime bridges the gap between your private data and the Azure cloud. The pipeline pattern is identical to what you already know — Lookup, ForEach, Copy — the only difference is the linked service uses SHIR instead of Azure IR.

In enterprise projects, SHIR is one of the first things set up because almost every organization has on-premises data that needs to flow to the cloud. Master this pattern, and you can handle any hybrid data integration scenario.

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.