Breaking Down Data Silos with Data Orchestration

Just as compute and containers need Kubernetes for container orchestration, data orchestration provider Alluxio contends, data also needs orchestration—a tier that brings data locality, accessibility, and elasticity to compute across data silos, zones, regions, and clouds.

According to Alluxio, the rise of compute-intensive workloads and the adoption of the cloud has driven organizations to adopt a decoupled architecture for modern workloads—one in which compute scales independently from storage. While this enables scaling elasticity, it also introduces new data engineering problems.

To address these issues, Alluxio has announced Alluxio 2.0 with innovations for data engineers managing and deploying analytical and AI workloads in the cloud, particularly for hybrid and multi-cloud environments.

Alluxio was founded in 2015, with the 1.0 release following in 2016. It originated as the Tachyon project at the UC Berkley’s AMP Lab by then-PhD student and now Alluxio CTO Haoyuan Li. The open source technology is available in community and enterprise editions.

Key enhancements in Alluxio 2.0

Commenting on the Alluxio 2.0 release in a pre-launch interview, Dipti Borkar, VP, product management and marketing at Alluxio, said, “This is a major enhancement in terms of the number of contributors, lines of code, etc. And, most importantly, it furthers the ability for enterprises to truly adopt the cloud for data applications by giving them an option that keeps them cloud-agnostic, storage-agnostic, and compute-agnostic so that they don’t feel that they are locked in to a certain stack or a certain technology.”

There are a number of major pillars in the new release, said Borkar.

A key area of enhancement is in policy-driven data management. Alluxio 2.0 includes a new capability that allows data engineers to automate data movement across storage systems based on pre-defined policies on an automated and on-going basis. This means that as data is created and hot, warm, cold data is managed, Alluxio can automate tiering of data across any number of storage systems across on-premises and across all clouds. Data platform teams can also now reduce storage costs by automatically managing only the most important data in expensive storage systems and moving other data to cheaper storage alternatives.

“The second big area is making data access more compute-optimized and for that we have added capability where you can partition an Alluxio cluster so that you can dedicate certain nodes or workers for a specific framework,” said Borkar. Users can now partition a single Alluxio based on any dimension, so that datasets for each framework or workload isn’t contaminated by the other. The most common usage includes partitioning the cluster by framework such as Spark, Presto, and others. In addition, this allows for reduced data transfer costs, constraining data to stay within a specific zone or region. “This is in the context of making data more compute-centric.”

Another major advancement is the ability to access data and incorporate and aggregate it through RESTful endpoints. There are many external data sources such as Kaggle and others that provide data and access via REST but today when a data scientist tries to combine this data with other enterprise data it is quite a challenge, said Borkar. Users can now bring in data even from web-based data sources to aggregate in Alluxio to perform their analytics. Any web location with files can be pointed to Alluxio to be pulled in as needed based on the query or model run. “This allows access to the freshest data on demand as needed,” Borkar said.

In addition, said Borkar, “We are seeing a lot more movement in the cloud, particularly data services within the mega cloud. AWS, for example, has a service called EMR, Elastic MapReduce service, and it essentially creates an analytics stack with a click of a few buttons or a single command. We have integrated with that service to make it very easy for Alluxio to be installed as a part of the service when it comes up so that, if you create a 500-node cluster on EMR, every node has Alluxio to accelerate your workload and help improve the performance of S3 and the object storage underneath it.”

Another big category of innovation in the new release is the enhancement of foundational elements which have been re-architected using open source technologies with a vision of hyper scale. “We have made some critical changes to two areas,” said Borkar. “The first is metadata management. By default, metadata management was done in memory and that is very fast but it also becomes limiting to the number of files that you can support. And now we have added tiered metadata management using open source technology called RocksDB and we use an application of RocksDB for metadata management so that we can scale out to billions of files.”

And, in addition to that, Borkar said Alluxio has also worked to improve its transport communication layer, using Google’s RPC protocol, called GRPC, to make communication within the distributed system efficient and capable of scaling out to thousands of nodes. GRPC is now the core transport protocol used for communication within the cluster as well as between the Alluxio client and master.

“With technologies like Alluxio that can orchestrate across systems, and even on an ongoing basis, enterprises can leverage the cloud to burst out their compute and then move data on demand as needed,” said Borkar. “That is where the innovation is. Other organizations are solving some smaller parts of the problem but in the area of true orchestration, it is the first one of its kind.”

Both Alluxio 2.0 Community and Enterprise Edition are now generally available for download via tarball, docker, brew, and others.

For more information, go to the Alluxio website at