Databricks Sets New World Record for CloudSort Benchmark Using Apache Spark

Nov 16, 2016

Databricks has announced that, in collaboration with industry partners, it has broken the world record in the CloudSort Benchmark, a third-party industry benchmarking competition for processing large datasets. Databricks was founded by the team that created the Apache Spark project.

Utilizing Apache Spark and working in collaboration with Nanjing University and Alibaba Group to form the team, NADSort, Databricks architected an efficient cloud platform for data processing. The platform sorted 100TB of data at a total cost of $144, or $1.44 per TB, worth of cloud computing resources for both the Daytona and Indy CloudSort competitions. This record outperformed the previously held record by of $4.51 per TB.

According to Databricks, the purpose of the CloudSort Benchmark entry is to measure the lowest cost in public cloud pricing per terabyte, reducing the total cost of ownership of the cloud architecture (a combination of software stack, hardware stack, and tuning) and encouraging organizations to adopt and deploy big data applications onto the public cloud.

In 2014, Databricks set the record for Gray Sort Benchmark, sorting 100TB of data, or 1 trillion records in 23 minutes, which was 30 times more efficient per node than the previous record. The sorting program, based on the Databricks’ 2014 record and updated for better efficiency for the cloud, ran on 394 ECS.n1.large nodes on the Alibaba Cloud, each equipped with an Intel Haswell E5-2680 v3 processor, 8GB of memory, and 4x135 GB SSD Cloud Disk.

Three factors made this CloudSort cost efficiency possible, according to Databricks chief architect and leader of the CloudSort Benchmark project, Reynold Xin. One is that increased competition among major cloud providers has lowered the cost of resources, making deploying applications in the cloud economically feasible and scalable; and the second is the innovations in Apache Spark, such as Project Tungsten, Catalyst, and whole-stage code generation, has benefited Spark enormously improving all aspects of the Spark stack. And finally, the in-house expertise in Spark and expertise gained in operating and tuning cloud-native data architecture for customers have led to incremental gains of efficiency, developing the most efficient cloud architecture for data processing.

Read Xin’s blog to learn more at https://databricks.com/blog/2016/11/14/setting-new-world-record-apache-spark.html