Powerful ETL Technologies in the Microsoft Data Platform

Mar 3, 2016

By Kevin Kline

Microsoft has dabbled in the ETL (extract-transform-load) marketplace for a long time, in fact, almost 2 decades. Way back in the day, SQL Server shipped with a command-line tool known as the Bulk Copy Utility (BCP) inherited from the Sybase acquisition. This hairy little beast met the bare minimum of ETL needs. It could extract fairly well and load, too. But it forced most all data transformations to happen inside SQL Server using T-SQL scripts. (The BCP utility is still available in SQL Server today). Later, Microsoft provided a competent little data movement utility known as the Data Transformation Service (DTS) for v7 and v2000 releases. DTS was competent at extracting and loading data, and could handle relatively lightweight transformation of data. Plus, it’s elegant and easy UI made it a popular ETL tool at the time.

However, Microsoft first truly disrupted the ETL marketplace with the introduction of SQL Server Integration Services (SSIS) back with the release of SQL Server 2005. Companies who specialized in ETL products, such as Informatica, began to take notice. SSIS is a truly powerful ETL product. It supports extremely sophisticated ETL operations and, with the use of BIML (Business Intelligence Markup Language), a strong scripting language to automate complex and repeatable tasks. SSIS ships with all on-premises license of SQL Server and is now so popular that many IT shops use SSIS without necessarily using the SQL Server relational database itself.

Enter Azure Data Factory

Microsoft has upped the ante yet again by bringing to market powerful ETL features to the cloud via the Azure Data Factory (http://azure.microsoft.com/en-us/documentation/services/data-factory/). Data Factory enables IT shops to integrate a multitude of data sources (the inputs are called “datasets,” while the outputs are called “linked services”), both on-premises and in the cloud, via a workflow (called a “pipeline) that utilizes Hive, Pig, and customized C# programs. Data Factory is fully compatible with the era of big data and has wide support for semi-structured (IoT telemetry, log files,etc.) unstructured (Azure HDInsight, Cloudera, DataStax, etc.), and structured data (SQL Server, Oracle, MySQL, etc.) from on-premises and cloud sources.

When composing a new pipeline, you might connect local, on-premises databases through a data management gateway, for example, to an Azure HDInsight or Azure SQL Database in the cloud, clean the data, perform complex transformations upon the data, analyze the data using Azure Machine Learning and/or Azure Stream Analytics, and then mask various bits of the data before loading it onto its target data platform. When authoring the pipeline, you can check a visual layout of all of the data inputs and outputs, along with all of the relationships and dependencies, using the Azure Preview Portal. Note that you have other options than the Azure Portal to author a pipeline, such as PowerShell (using Azure Resource Manager templates), Visual Studio (Azure .NET SDK 2.7 or later), and the REST API found within the Azure Data Factory SDK.

Once you’re satisfied with your pipeline, you can set an execution schedule for Data Factory to process at your preferred time. It tracks a lot of performance metrics, providing you with a historical account of job execution, data production throughput, and system health for all systems involved, all from the single dashboard of the