Azure Arc and the Rising Tide of Kubernetes

At first glance, a new product billed as part of the Azure Man­agement platform might not seem interesting or relevant for data management professions. But as you will see, if your organiza­tion is using or thinking about using Microsoft Azure, then this is important news for you.

A Foundation of Containers Expands With Kubernetes

I first wrote about Docker containers and Kubernetes in the February/March 2019 issue of DBTA. To summarize, containers are somewhat akin to virtual machines (VM), but they are smaller, faster, and less resource-intensive than a VM running the same workload with similarly provisioned resources. Containers make it very easy to bundle an entire application, tools, libraries, and configuration files. For example, you can investigate a wide variety of fully configured Azure Machine Learning container images posted by Microsoft at

Since it is so much easier to spin up and tear down containers, many enterprises are anticipating a lot of growth in this area. But this growth is not all butter­flies and daisies. Managing a multitude of containers, microservices, and applications that are growing rap­idly is a very difficult proposition. Enter Kubernetes (k8s), orchestration software that enables you to manage processes such as load balancing, automated patching, and upgrades and elas­tically scale up and down services as needed across hybrid and multi-cloud computing environments. All of this now runs through Azure Arc. (The new Microsoft landing page is located at https://azure.mic­ 

The Secret Ingredient in the Azure Arc Recipe

What’s so special about Azure Arc? The private preview of Azure Arc currently includes support for Azure SQL Database and Azure Database for PostgreSQL Hyperscale. But it’s not hard to imagine a scenario where you might have a single management system that cov­ers a disparate a selection of computing resources such as on-premise physical servers, Hyper-V and VMware virtual machines, a few old Oracle databases running on Linux, some open source software built on PostgreSQL, a selection of external clusters running on AWS and Google Cloud Platform, and a new installation of Azure Stack.

SQL Server Big Data Clusters, Built With Kubernetes in Mind

SQL Server 2019 introduced the new and powerful variant known as SQL Server Big Data Clusters (BDC). Details are at

BDC, as the name implies, enables you to deploy highly scal­able multi-node clusters of SQL Server, Spark, and Hadoop HDFS using Kubernetes running at truly high-end big data scale. Key use cases include AI and machine learning, multi-node data lakes, and data marts scaled out across dozens or hundreds of nodes. In addition, BDC allows you to “virtualize” data, that is, query exter­nal data sources without moving or copying the external data via a feature called PolyBase.

PolyBase enables you to connect to external database systems such as SQL Server, Azure Data Lake Store (or other HDFS-based data storage systems), Azure Blog Storage, Teradata, MongoDB, Oracle, and any database system with ODBC con­nectivity. PolyBase enables you to write standard T-SQL code to joins across the various data sources without any special considerations. PolyBase features addi­tional performance capabilities, such as using statistics on external tables to make a cost-based decision to push computation via MapReduce jobs to Hadoop, saving SQL Server the work. PolyBase also provides linear scale-out of pro­cessing power using a feature known as scale-out groups, basically allowing you to shard your SQL Server workload across multiple instances in a parallel process­ing architecture. Details for PolyBase can be found at

BDC also features a powerful application deployment archi­tecture for rolling out a scalable and high-performance applica­tion running on a big data cluster, consisting of a controller and app runtime handlers. Again, the entire process is orchestrated and managed using Kubernetes.

OK—But How?

On the administrative side, Microsoft provides a layer called the Azure Resource Manager (ARM), which handles all of the services that provide resources, such as Azure SQL Database, VMs, Azure Kubernetes Service, or any other Kubernetes clus­ter service. You define the exact configuration of the resources provided via ARM templates, which describe the desired state of the resource. Azure Arc takes ARM provisioning a step further by allowing you to configure resources running out­side of Azure. The part that is amazing and innovative is that Azure Arc can manage resources inside and outside of Azure and inside and outside of your data center—all through one methodology.