Building an On-Premises to Azure Data Pipeline with Self-Hosted Integration Runtime

Building an On-Premises to Azure Data Pipeline with Self-Hosted Integration Runtime

In every previous pipeline on this blog, both source and destination were cloud resources — Azure SQL Database and ADLS Gen2. The default Azure IR handled everything. But in the real world, most enterprise data lives on-premises — behind corporate firewalls, inside private networks, with no public endpoint.

This is where the Self-Hosted Integration Runtime (SHIR) comes in. It is a lightweight Windows agent you install on a machine inside your private network that acts as a bridge between your on-premises data and Azure.

In this guide, I will walk you through building a complete on-premises to Azure data pipeline — from creating a simulated on-premises environment (an Azure VM with SQL Server) to installing SHIR, building the metadata-driven pipeline, and copying data to ADLS Gen2. This is the exact pattern enterprises use for hybrid cloud data migration.

Table of Contents

  • Why Self-Hosted IR?
  • Architecture: How SHIR Works
  • What We Are Building
  • Part 1: Create the Simulated On-Premises Environment
  • Part 2: Set Up SQL Server and Create Sample Data
  • Part 3: Install and Register Self-Hosted IR
  • Part 4: Create the Linked Service via SHIR
  • Part 5: Create Datasets
  • Part 6: Build the Pipeline
  • Part 7: Run, Debug, and Verify
  • SHIR Under the Hood
  • SHIR vs Azure IR: Key Differences
  • High Availability with Multiple Nodes
  • Network Requirements
  • Monitoring and Troubleshooting SHIR
  • Cost Considerations
  • Production Best Practices
  • Interview Questions
  • Wrapping Up

Why Self-Hosted IR?

Azure IR can only reach public endpoints — resources accessible over the internet. It cannot access:

  • On-premises SQL Server behind a corporate firewall
  • Oracle databases in a private data center
  • File shares on a company network
  • VMs in an Azure VNet without public endpoints
  • Any resource requiring VPN or ExpressRoute access

SHIR solves this by sitting inside your private network and creating outbound-only HTTPS connections to Azure. No inbound firewall rules needed — this is a major security advantage.

Architecture: How SHIR Works

On-Premises Network                    Azure Cloud
+-----------------------------------+  +---------------------------+
|                                   |  |                           |
|  SQL Server (source data)         |  |  Synapse Studio           |
|       |                           |  |  (orchestration)          |
|       v                           |  |       |                   |
|  Self-Hosted IR (Windows service) |------>   |                   |
|       |                           |  |       v                   |
|       +------------------------------------> ADLS Gen2 (sink)    |
|                                   |  |                           |
+-----------------------------------+  +---------------------------+
         Outbound HTTPS (443)
         No inbound ports needed

Here is what happens during a pipeline run:

  1. You trigger the pipeline in Synapse Studio
  2. Synapse sends job instructions to the SHIR through Azure Service Bus
  3. SHIR receives the instructions: “Read from SQL Server, table dbo.Employee”
  4. SHIR connects to SQL Server locally (same network, no firewall issues)
  5. SHIR reads the data, compresses and encrypts it
  6. SHIR pushes the encrypted data directly to ADLS Gen2 (outbound HTTPS)
  7. SHIR reports back to Synapse: “Done, 1000 rows copied”

The data never passes through the Synapse service itself — it goes directly from SHIR to ADLS Gen2. Synapse only orchestrates.

What We Are Building

A metadata-driven pipeline that copies multiple tables from an on-premises SQL Server to ADLS Gen2 in Parquet format, using the same Lookup, ForEach, Copy pattern from our previous posts — but with a Self-Hosted IR for the source connection.

Pipeline: PL_OnPrem_To_Azure
|
|-- Lookup_Config (reads ConfigTable from on-prem SQL via SHIR)
|-- ForEach_Table
    |-- Copy_TableData
        |-- Source: On-prem SQL Server (via SHIR)
        |-- Sink: ADLS Gen2 Parquet (via Azure IR)

Part 1: Create the Simulated On-Premises Environment

Since you may not have an actual on-premises server, we simulate one using an Azure VM with SQL Server. From the SHIR perspective, it behaves identically to a physical server in a data center.

Create the VM

  1. Azure Portal > Create a resource > Virtual Machine
  2. Configure:
  3. Name: naveen-sql-vm
  4. Region: Canada Central (same as your Synapse workspace)
  5. Image: SQL Server 2022 Developer on Windows Server 2022 (search in marketplace)
  6. Size: Standard_B2s (2 vCPU, 4 GB RAM — cheapest option, approximately $1-2/day)
  7. Username: sqladmin
  8. Password: set a strong password
  9. Public inbound ports: Allow RDP (3389)
  10. On the SQL Server settings tab:
  11. SQL connectivity: Public (for initial setup)
  12. SQL authentication: Enable
  13. SQL login: sqladmin with your password
  14. Click Review + Create > Create

Open SQL Server Port

  1. Go to the VM > Networking > Network settings
  2. Create port rule > Inbound port rule
  3. Destination port: 1433, Protocol: TCP, Action: Allow
  4. Name: Allow_SQL_1433
  5. Click Add

Connect via RDP

  1. On the VM overview page, click Connect > RDP
  2. Download the RDP file and open it
  3. Login with sqladmin and your password

Part 2: Set Up SQL Server and Create Sample Data

RDP into the VM, open SQL Server Management Studio (SSMS), and run:

Create the Database

CREATE DATABASE OnPremDB;
GO
USE OnPremDB;

Create Config Table

CREATE TABLE ConfigTable (
    Id              INT IDENTITY(1,1),
    TableName       VARCHAR(100),
    SchemaName      VARCHAR(50),
    FolderName      VARCHAR(200)
);

INSERT INTO ConfigTable (TableName, SchemaName, FolderName) VALUES
('Employees',   'dbo', 'onprem/Employees'),
('Departments', 'dbo', 'onprem/Departments'),
('Sales',       'dbo', 'onprem/Sales');

Create Source Tables with Data

CREATE TABLE Employees (
    EmpId       INT PRIMARY KEY,
    EmpName     VARCHAR(100),
    Department  VARCHAR(50),
    Salary      INT,
    HireDate    DATE
);

INSERT INTO Employees VALUES
(1, 'Naveen',  'Engineering', 95000, '2022-03-15'),
(2, 'Priya',   'Sales',       72000, '2023-01-10'),
(3, 'Ravi',    'Engineering', 88000, '2021-08-22'),
(4, 'Anita',   'Marketing',   68000, '2024-02-01'),
(5, 'Vikram',  'Engineering', 91000, '2022-11-30');

CREATE TABLE Departments (
    DeptId      INT PRIMARY KEY,
    DeptName    VARCHAR(50),
    Location    VARCHAR(50),
    Manager     VARCHAR(100)
);

INSERT INTO Departments VALUES
(1, 'Engineering', 'Toronto',  'Naveen'),
(2, 'Sales',       'Mumbai',   'Priya'),
(3, 'Marketing',   'London',   'Anita'),
(4, 'HR',          'Toronto',  'Sunita');

CREATE TABLE Sales (
    SaleId      INT PRIMARY KEY,
    ProductName VARCHAR(100),
    Amount      DECIMAL(10,2),
    SaleDate    DATE,
    Region      VARCHAR(50)
);

INSERT INTO Sales VALUES
(1, 'Laptop Pro',    1299.99, '2026-01-15', 'North America'),
(2, 'Cloud License', 499.00,  '2026-02-10', 'Europe'),
(3, 'Data Platform', 2500.00, '2026-03-05', 'Asia Pacific'),
(4, 'Analytics Suite',899.99, '2026-03-20', 'North America'),
(5, 'Storage Plan',   199.00, '2026-04-01', 'Europe');

Verify

SELECT * FROM ConfigTable;    -- 3 rows
SELECT * FROM Employees;      -- 5 rows
SELECT * FROM Departments;    -- 4 rows
SELECT * FROM Sales;          -- 5 rows

Part 3: Install and Register Self-Hosted IR

Create SHIR in Synapse Studio

  1. Open Synapse Studio > Manage > Integration runtimes
  2. Click + New > Self-Hosted > Continue
  3. Name: SHIR-OnPrem-SQL
  4. Click Create
  5. Copy Key 1 (authentication key)

Install SHIR on the VM

  1. RDP into the VM
  2. Open Edge browser and go to: https://www.microsoft.com/en-us/download/details.aspx?id=39717
  3. Download the latest Integration Runtime installer (MSI file)
  4. Run the installer — click Next through the wizard with defaults
  5. The Integration Runtime Configuration Manager opens automatically
  6. Paste the Key 1 from Synapse Studio
  7. Click Register > Finish
  8. Wait 1-2 minutes — status should change to Running

Verify in Synapse Studio

Go to Manage > Integration runtimes. Your SHIR-OnPrem-SQL should show status Running with 1 node.

The SHIR runs as a Windows service called Microsoft Integration Runtime Service. It starts automatically when the VM boots.

Part 4: Create the Linked Service via SHIR

LS_OnPrem_SqlServer

  1. Synapse Studio > Manage > Linked services > + New
  2. Search for SQL Server (not Azure SQL Database — plain SQL Server)
  3. Name: LS_OnPrem_SqlServer
  4. Connect via integration runtime: select SHIR-OnPrem-SQL
  5. Server name: the VM public IP address (e.g., 20.xx.xx.xx) or localhost if SHIR is on the same machine
  6. Database name: OnPremDB
  7. Authentication: SQL Authentication
  8. Username: sqladmin
  9. Password: your SQL auth password
  10. Test connection > should succeed > Create

Key Difference from Azure SQL Linked Service

Property Azure SQL DB (cloud) SQL Server via SHIR (on-prem)
Type Azure SQL Database SQL Server
Connect via Azure IR (AutoResolve) Self-Hosted IR (SHIR-OnPrem-SQL)
Server name server.database.windows.net IP address or hostname
Authentication SQL Auth or Managed Identity SQL Auth or Windows Auth
Network Public internet Private network via SHIR

Part 5: Create Datasets

A. DS_OnPrem_ConfigTable

  1. Data > + > Integration dataset > SQL Server
  2. Name: DS_OnPrem_ConfigTable
  3. Linked service: LS_OnPrem_SqlServer
  4. Table: dbo.ConfigTable

B. DS_OnPrem_SourceTable (Parameterized)

  1. Data > + > Integration dataset > SQL Server
  2. Name: DS_OnPrem_SourceTable
  3. Linked service: LS_OnPrem_SqlServer
  4. Leave table blank
  5. Parameters: SchemaName (String), TableName (String)
  6. Connection tab > Enter manually:
  7. Schema: @dataset().SchemaName
  8. Table: @dataset().TableName

C. DS_ADLS_Sink_Parquet (Reuse or Create)

If you already have this from previous tutorials, reuse it. Otherwise:

  1. Data > + > Integration dataset > ADLS Gen2 > Parquet
  2. Name: DS_ADLS_Sink_Parquet
  3. Linked service: default workspace storage
  4. Parameters: FolderName (String)
  5. Connection tab:
  6. Container: synapse-workspace (or your container)
  7. Directory: @dataset().FolderName

Part 6: Build the Pipeline

Create pipeline: PL_OnPrem_To_Azure

A. Lookup_Config

  1. Drag Lookup activity
  2. Name: Lookup_Config
  3. Settings:
  4. Dataset: DS_OnPrem_ConfigTable
  5. First row only: UNCHECKED

B. ForEach_Table

  1. Drag ForEach activity
  2. Name: ForEach_Table
  3. Connect: green arrow from Lookup_Config
  4. Settings:
  5. Items: @activity('Lookup_Config').output.value
  6. Sequential: unchecked

C. Inside ForEach — Copy_TableData

  1. Click pencil icon on ForEach to open inner canvas
  2. Drag Copy data activity
  3. Name: Copy_TableData

Source tab: – Dataset: DS_OnPrem_SourceTable – SchemaName: @item().SchemaName – TableName: @item().TableName

Sink tab: – Dataset: DS_ADLS_Sink_Parquet – FolderName: @item().FolderName

Part 7: Run, Debug, and Verify

  1. Click Validate — fix any errors
  2. Click Publish all
  3. Click Debug

Expected Output

Lookup_Config        > Succeeded (3 rows)
ForEach_Table        > Succeeded
  Copy (Employees)   > Succeeded (5 rows, via SHIR-OnPrem-SQL)
  Copy (Departments) > Succeeded (4 rows, via SHIR-OnPrem-SQL)
  Copy (Sales)       > Succeeded (5 rows, via SHIR-OnPrem-SQL)

Verify in ADLS Gen2

synapse-workspace/
  onprem/
    Employees/
      part-00000.snappy.parquet
    Departments/
      part-00000.snappy.parquet
    Sales/
      part-00000.snappy.parquet

Check the Integration Runtime Used

In the Monitor output, click on a Copy activity. The output shows:

{
    "effectiveIntegrationRuntime": "SHIR-OnPrem-SQL",
    "executionDetails": [{
        "source": {"type": "SqlServer"},
        "sink": {"type": "AzureBlobFS"}
    }]
}

Notice effectiveIntegrationRuntime shows SHIR-OnPrem-SQL — confirming data was read through the Self-Hosted IR, not the default Azure IR.

SHIR Under the Hood

How Data Flows

1. Synapse sends job instructions to SHIR via Azure Service Bus (outbound poll)
2. SHIR connects to SQL Server locally (same network)
3. SHIR executes: SELECT * FROM dbo.Employees
4. SHIR reads result set into memory
5. SHIR compresses the data (Snappy for Parquet)
6. SHIR encrypts and pushes data to ADLS Gen2 (outbound HTTPS)
7. SHIR reports completion to Synapse

The data goes directly from SHIR to ADLS Gen2. It never passes through the Synapse service. Synapse only orchestrates.

The SHIR Windows Service

After installation, SHIR runs as Microsoft Integration Runtime Service in Windows Services. It starts automatically on boot and runs continuously, polling Azure Service Bus for work.

You can monitor it through the Integration Runtime Configuration Manager desktop app on the VM.

SHIR vs Azure IR: Key Differences

Feature Azure IR Self-Hosted IR
Managed by Microsoft You (install and maintain)
Runs on Azure managed compute Your Windows machine/VM
Reaches Public endpoints only Private networks and on-premises
Data Flows Supported Not supported
Cost Per DIU-hour Free (but you pay for the machine)
Setup Instant (default exists) 15-30 minutes (download, install, register)
High availability Built-in Multi-node (up to 4 machines)
Maintenance Automatic updates Manual or auto-update (configurable)

High Availability with Multiple Nodes

For production, install SHIR on multiple machines:

  1. Install SHIR on Machine 1 using Key 1
  2. Install SHIR on Machine 2 using the same key
  3. Both nodes register under the same logical IR name
  4. If Machine 1 goes down, Machine 2 handles all traffic

Up to 4 nodes per SHIR. The nodes automatically load-balance concurrent copy activities.

Network Requirements

Direction Port Protocol Purpose
Outbound 443 HTTPS Communication with Azure Service Bus and ADLS Gen2
Outbound 443 HTTPS Azure Resource Manager (registration)
Inbound None No inbound rules required
Local 1433 TCP SQL Server (or whatever port your database uses)

The SHIR only makes outbound connections. This is why it works in heavily restricted corporate networks — outbound HTTPS on port 443 is almost always allowed.

Monitoring and Troubleshooting SHIR

Check SHIR Status

  • Synapse Studio: Manage > Integration runtimes > click on SHIR name
  • On the VM: Open Integration Runtime Configuration Manager
  • Diagnostics: The Configuration Manager has a Diagnostics tab that tests connectivity

Common Issues

Issue Cause Fix
SHIR shows Offline VM is stopped or IR service crashed Start the VM; restart the IR service
“Unable to connect to data source” SQL Server not accessible from SHIR machine Verify SQL port 1433 is open; test with telnet localhost 1433
“Registration failed” Wrong authentication key Copy the key again from Synapse Studio; ensure no extra spaces
Slow data transfer Undersized VM or network bottleneck Upgrade VM size; check network bandwidth
SHIR auto-update failed No internet during update window Manually update by downloading latest installer

Check SHIR Logs

On the VM, SHIR logs are located at:

C:\Program Files\Microsoft Integration Runtime.0\Shared  Logs    IntegrationRuntime.log
    IntegrationRuntimeService.log

Cost Considerations

Component Cost How to Minimize
Azure VM (B2s) ~$1-2/day when running Stop (deallocate) when not using it — only ~$5/month for disk
SQL Server Developer Free Use Developer edition for learning
SHIR software Free No charge for the IR itself
ADLS Gen2 storage ~$0.02/GB/month Minimal for learning data volumes

Critical tip: Always Stop (Deallocate) the VM from Azure Portal when you are done. A running VM costs money 24/7. A stopped VM only costs for disk storage.

Production Best Practices

  1. Install SHIR on a dedicated machine — do not share with application servers. SHIR can be CPU and memory intensive during large data transfers.

  2. Use multiple nodes for high availability — minimum 2 nodes in production.

  3. Use Windows Authentication instead of SQL Authentication when possible — integrates with Active Directory, more secure.

  4. Keep SHIR updated — enable auto-update or schedule regular manual updates.

  5. Monitor the SHIR machine — set up alerts for CPU, memory, and disk usage. A maxed-out SHIR machine means slow or failed copies.

  6. Use ExpressRoute or VPN for production — instead of public IP, connect the SHIR machine to Azure via ExpressRoute or Site-to-Site VPN for better security and performance.

  7. Separate SHIR per environment — Dev SHIR connects to dev sources. Prod SHIR connects to prod sources. Never share.

Interview Questions

Q: What is a Self-Hosted Integration Runtime? A: A Windows agent installed on a machine inside your private network that acts as a bridge between on-premises data sources and Azure Data Factory/Synapse. It makes outbound-only HTTPS connections to Azure, requiring no inbound firewall rules.

Q: When do you need a Self-Hosted IR? A: When your data source is behind a firewall, in a private network, or on-premises. Azure IR can only reach public endpoints. SHIR enables ADF/Synapse to access private data stores.

Q: How does SHIR transfer data? A: SHIR reads data from the local source, compresses and encrypts it, then pushes it directly to the Azure destination (ADLS Gen2, Azure SQL, etc.) via outbound HTTPS. The data never passes through the ADF/Synapse service itself.

Q: Does SHIR support Data Flows? A: No. Data Flows require a Spark cluster, which only Azure IR provides. To transform on-premises data, first copy it to ADLS Gen2 using SHIR, then run a Data Flow using Azure IR.

Q: How do you make SHIR highly available? A: Install SHIR on up to 4 machines using the same authentication key. All nodes register under one logical IR. If one node fails, others continue processing automatically.

Q: What are the network requirements? A: Only outbound HTTPS on port 443. No inbound ports needed. The SHIR machine needs local network access to the source database and outbound internet to Azure.

Wrapping Up

The Self-Hosted Integration Runtime bridges the gap between your private data and the Azure cloud. The pipeline pattern is identical to what you already know — Lookup, ForEach, Copy — the only difference is the linked service uses SHIR instead of Azure IR.

In enterprise projects, SHIR is one of the first things set up because almost every organization has on-premises data that needs to flow to the cloud. Master this pattern, and you can handle any hybrid data integration scenario.

Related posts:Integration Runtime Types GuideMetadata-Driven Pipeline in ADFSynapse Pipeline with Audit LoggingAzure Fundamentals (IAM, Subscriptions)What is Azure Data Factory?


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link