Azure Arc and the Rising Tide of Kubernetes

Feb 10, 2020

By Kevin Kline

At first glance, a new product billed as part of the Azure Management platform might not seem interesting or relevant for data management professions. But as you will see, if your organization is using or thinking about using Microsoft Azure, then this is important news for you.

A Foundation of Containers Expands With Kubernetes

I first wrote about Docker containers and Kubernetes in the February/March 2019 issue of DBTA. To summarize, containers are somewhat akin to virtual machines (VM), but they are smaller, faster, and less resource-intensive than a VM running the same workload with similarly provisioned resources. Containers make it very easy to bundle an entire application, tools, libraries, and configuration files. For example, you can investigate a wide variety of fully configured Azure Machine Learning container images posted by Microsoft at https://gallery.azure.ai.

Since it is so much easier to spin up and tear down containers, many enterprises are anticipating a lot of growth in this area. But this growth is not all butterflies and daisies. Managing a multitude of containers, microservices, and applications that are growing rapidly is a very difficult proposition. Enter Kubernetes (k8s), orchestration software that enables you to manage processes such as load balancing, automated patching, and upgrades and elastically scale up and down services as needed across hybrid and multi-cloud computing environments. All of this now runs through Azure Arc. (The new Microsoft landing page is located at https://azure.microsoft.com/en-us/services/azure-arc.)

The Secret Ingredient in the Azure Arc Recipe

What’s so special about Azure Arc? The private preview of Azure Arc currently includes support for Azure SQL Database and Azure Database for PostgreSQL Hyperscale. But it’s not hard to imagine a scenario where you might have a single management system that covers a disparate a selection of computing resources such as on-premise physical servers, Hyper-V and VMware virtual machines, a few old Oracle databases running on Linux, some open source software built on PostgreSQL, a selection of external clusters running on AWS and Google Cloud Platform, and a new installation of Azure Stack.

SQL Server Big Data Clusters, Built With Kubernetes in Mind

SQL Server 2019 introduced the new and powerful variant known as SQL Server Big Data Clusters (BDC). Details are at https://docs.microsoft.com/en-us/sql/big-data-cluster/big-data-cluster-overview?view=sql-server-ver15.

BDC, as the name implies, enables you to deploy highly scalable multi-node clusters of SQL Server, Spark, and Hadoop HDFS using Kubernetes running at truly high-end big data scale. Key use cases include AI and machine learning, multi-node data lakes, and data marts scaled out across dozens or hundreds of nodes. In addition, BDC allows you to “virtualize” data, that is, query external data sources without moving or copying the external data via a feature called PolyBase.

PolyBase enables you to connect to external database systems such as SQL Server, Azure Data Lake Store (or other HDFS-based data storage systems), Azure Blog Storage, Teradata, MongoDB, Oracle, and any database system with ODBC connectivity. PolyBase enables you to write standard T-SQL code to joins across the various data sources without any special considerations. PolyBase features additional performance capabilities, such as using statistics on external tables to make a cost-based decision to push computation via MapReduce jobs to Hadoop, saving SQL Server the work. PolyBase also provides linear scale-out of processing power using a feature known as scale-out groups, basically allowing you to shard your SQL Server workload across multiple instances in a parallel processing architecture. Details for PolyBase can be found at https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-guide?view=sql-server-ver15

BDC also features a powerful application deployment architecture for rolling out a scalable and high-performance application running on a big data cluster, consisting of a controller and app runtime handlers. Again, the entire process is orchestrated and managed using Kubernetes.

OK—But How?

On the administrative side, Microsoft provides a layer called the Azure Resource Manager (ARM), which handles all of the services that provide resources, such as Azure SQL Database, VMs, Azure Kubernetes Service, or any other Kubernetes cluster service. You define the exact configuration of the resources provided via ARM templates, which describe the desired state of the resource. Azure Arc takes ARM provisioning a step further by allowing you to configure resources running outside of Azure. The part that is amazing and innovative is that Azure Arc can manage resources inside and outside of Azure and inside and outside of your data center—all through one methodology.