Alluxio Streamlines Data Pre-Processing and Loading Phases in Data Orchestration 2.6

Alluxio, the developer of open source data orchestration software for large-scale workloads, is releasing version 2.6 of its Data Orchestration Platform, featuring an enhanced system architecture that enables AI/ML platform teams using GPUs to accelerate their data pipelines for business intelligence, applied machine learning, and model training.

“Enterprises seeking competitive advantage are making greater use of machine learning and AI to derive insights from massive datasets,” said Haoyuan Li, founder and CEO, Alluxio. “These datasets are often distributed across hybrid cloud environments, making more consistent and efficient data access critical to realizing the value from their AI/ML initiatives.”

In the latest release, Alluxio improves its system architecture to best support AI/ML applications using the POSIX interface.

System performance is maximized by removing inter-process latency overheads, which is critical for enabling full utilization of compute resources, according to the vendor.

Aside from I/O performance, the end-to-end workflow of data preprocessing, loading, training, and result writing is well supported by Alluxio’s data management capabilities.

Alluxio 2.6 Community and Enterprise Edition features new capabilities, including:

  • Faster Data Access with a Large Number of Small Files: Alluxio 2.6 unifies the Alluxio worker and FUSE process. By coupling the two, significant performance improvements are achieved due to reductions in inter-process communication.
  • Simplified Data Management and Operability: Alluxio 2.6 enhances the mechanism to load data into Alluxio managed storage and introduces more traceability and metrics for easier operability. This distributed load operation is a key portion of the AI/ML workflow, and adjustments to the internal mechanisms have been made to optimize for the common case of loading prepared data for model training.
  • Improved System Visibility and Control: Alluxio 2.6 adds a large set of metrics and traceability features enabling users to drill into the system’s operating state. These range from aggregated throughput of the system to summarized metadata latency when serving client requests.


For more information about these updates, visit