The Rise of Data Orchestration: Q&A with Alluxio’s Dipti Borkar

By Joyce Wells

Jul 30, 2019

Alluxio was founded in 2015, with its 1.0 release following in 2016. It originated as the Tachyon project at the UC Berkley’s AMP Lab by then-Ph.D. student and now Alluxio CTO Haoyuan Li.

As the 2.0 release was rolled out in July, Dipti Borkar, VP, product management and marketing at Alluxio, reflected on the data engineering problems that have emerged as a result of the increasingly decoupled architecture for modern workloads. Just as compute and containers need Kubernetes for container orchestration, Alluxio contends, data also needs orchestration—a tier that brings data locality, accessibility, and elasticity to compute across data silos, zones, regions, and clouds.

BDQ: What is going on in the data management and analytics space in general now?

Dipti Borkar: Historically, databases, data management systems, and data warehouses were all kind of tightly aggregated and vertically integrated to work in one location, one server, or one virtual machine. What we are seeing now is a lot more of the separation of the processing itself using different techniques, and then the storage may live somewhere else in a completely different location.

BDQ: Why?

DB: In the past, everything was on-premise, but now with cloud getting a lot more adoption for data applications, users have started to move data to the cloud and are getting more comfortable with that notion. In addition, the compute framework is not located in the same location as the storage system, which might be the Hadoop HDFS or something else on-premise. This idea of a single data lake, which was supposed to be the HDFS, is becoming harder and harder to achieve. Every business unit has its own data and there are external data sources. And so I think that a lot of enterprises are coming to the conclusion that data is going to be siloed. Those are a few things that we are seeing, and that is what we are seeing from a data management and analytics perspective.

BDQ: How does Alluxio address this?

DB: Alluxio is actually a technology built to embrace data silos no matter where they live as opposed to a single data lake where everything works where it is located. There are a lot more disaggregated stack scenarios and a lot more adoption of cloud. Right now, it seems like it is a little more hybrid, but over time it will become more of a multi-cloud environment. They want to burst and they want to get the flexibility but it is hard to move the data. That is where Alluxio and data orchestration come in.

‘Just as Kubernetes is to compute and containers, data orchestration is to data and to active working sets of data.’

BDQ: What does it enable?

DB: Because of this separation and disaggregation of compute and storage, increasingly, you will need a layer that moves the data that is needed closer to the compute—the data that is needed, the data that is actively being used—rather than all the data which is going to be impossible to move around. That is where data orchestration comes in. It helps bring data closer to compute and makes it more accessible to the compute. The idea is to cache the most active data so you get data locality as well as make the same data accessible to many different APIs.

The file might be the same, it might be a Parquet file or an ORC [optimized row column] file but different frameworks may want to access it in different ways. Maybe they want to use the Hadoop API, the HDFS API, or an S3 API, or a file system API on the same data. And that is the other aspect of data orchestration—making the data more accessible to the computer frameworks on the top.

BDQ: What else is changing?

DB: The other thing that we are seeing is the rise of the object store. AWS S3 has become incredibly popular and is driving more object storage usage. However, these stores were not built for analytics and interactive applications, so again, you need a layer on top to accelerate them and bring that data to wherever that compute is. This enables you to get that data locality and speed up metadata operations as well as make the access strongly consistent—and that is something Alluxio helps with, as well.

BDQ: What does this enable?

DB: Eventually all of this is in the context of making the data self-service. Typically, we see that there are platform teams that are responsible for serving this data, as well as the frameworks, and they create a service within the enterprise itself for the different business units.

Technologies that are storage-agnostic and cloud-agnostic make data self-service access a lot easier and more efficient because they are not managing multiple copies of different datasets. The data is being synced back with wherever it lives and is being pulled out as needed, or on an on-demand basis. This will be the future. The most valuable data could be the most recent for one company, or for a retailer, it could be data about specific stores. Wherever it is, the query will pull that data. It is a compute-driven approach, in which, based on the compute, you bring the data closer to the frameworks. That is the overall way in which Alluxio fits in and how we are seeing these different issues, as well as solving these newer problems that are arriving.

BDQ: This sounds similar in some ways to containerization.

DB: There are two ways to think of it. First, as a piece of software, it works with containers. We have a Docker container and we work within Kubernetes as a daemon set. You can scale the cluster up and down and most importantly co-locate the Alluxio workloads, the Alluxio containers, with the compute that has the data within Alluxio.

BDQ: And the second way?

DB: The second aspect is a little more of an analogy because just as Kubernetes and container orchestration is for compute where it is basically making compute more elastic and enabling the deployment of containers anywhere they are needed, data itself needs orchestration in this disaggregated data world. When the data is needed, the working set for that compute, that framework, must be moved closer and that requires orchestration. That is what we are calling data orchestration. Just as Kubernetes is to compute and containers, data orchestration is to data and to active working sets of data.

BDQ: Is data orchestration spreading?

DB: Alluxio itself is an open source project, and because it is open source we have a very large community. We have 1,000-plus contributors to the project but more importantly we are seeing worldwide usage. There are large communities in Asia and the U.S.

BDQ: And the use cases?

DB: We live in the big data ecosystem and that is where Alluxio fits. We are seeing different use cases that are popular today. The first is accelerating cloud analytics and the second is hybrid analytics—hybrid cloud analytics using on-premise data. We are seeing most users start with a simple use case and then it grows out from that point. We are also seeing different projects starting to emerge to solve this data orchestration type of problem on the NAS [network-attached storage] side of things. There are also other projects coming up in other spaces to solve similar problems. It is a new problem that is emerging, so whether it is our community itself or other projects and products, we are seeing a lot more movement in the market.

Interview conducted, edited, and condensed by Joyce Wells.