First created as part of a research project at UC Berkeley AMPLab, Spark is an open source project in the big data space, built for sophisticated analytics, speed, and ease of use. It unifies critical data analytics capabilities such as SQL, advanced analytics, and streaming in a single framework. Databricks is a company that was founded by the team that created and continues to lead both the development and training around Apache Spark.
Here, Kavitha Mariappan, vice president of marketing at Databricks, discusses Spark’s rapid rise in the Hadoop ecosystem, and what’s ahead.
What is Spark?
Spark is a data processing framework and provides support for a variety of workloads including batch, streaming, interactive, and iterative processing and querying through a powerful set of built-in libraries—Spark Streaming, Spark SQL, MLlib, GraphX, and now SparkR. All of these libraries use the same execution engine and the same storage abstraction. This makes it possible to trivially solve multiple data problems provided by these libraries. For instance, one can easily invoke machine learning algorithms from Spark Streaming or from Spark, or use Spark SQL to query live streaming data. This tight integration will enable new applications for enterprise IT, which were not possible before, such as online fraud detection and real-time large-scale optimization.
What is Spark’s relationship to the Hadoop ecosystem?
Hadoop has developed an important ecosystem, and many Spark deployments are an integral part of a Hadoop environment. That said, we are seeing a lot of applications in non-Hadoop environments, either connecting directly to traditional database systems or in the public cloud without Hadoop. The goal of the Spark project is not to replace Hadoop, but rather to integrate and interpolate well with a variety of systems to make it easier to build more powerful applications.
How is Spark complementary to Hadoop? Can it be used without Hadoop?
Spark is compatible with the existing Hadoop stack. Hadoop as a whole generally means an entire ecosystem of software. The Hadoop stack consists of three layers: storage layer (HDFS), resource management layer (YARN), and execution layer (Hadoop MR). Spark is situated at the execution layer, runs on top of YARN, and can consume data from HDFS. As a result, Spark can seamlessly interoperate with Hadoop.
While there are many Spark deployments in the Hadoop environment, we are seeing a lot of applications in non-Hadoop environments. We are seeing more and more standalone Spark deployments (48% of Spark Survey 2015 respondents), as well as a rapid growth of Spark’s usage in the public cloud (51% of Spark Survey 2015 respondents). We expect these trends to continue.
What are the some of the key capabilities of Spark versus Hadoop MapReduce?
Spark enables enterprises to process large amounts of data faster than ever, and at the same time allows them to dramatically simplify their infrastructures by obviating the need to integrate a disparate set of complex tools.
These benefits follow directly from Spark’s ability to provide three highly desirable properties. First, Spark goes beyond batch computation and provides a unified platform that supports streaming, interactive analytics, and sophisticated data processing such as machine learning and graph algorithms with its built-in libraries. Second, it is fast, as it has been built from the ground up to process data in memory. Spark’s optimizations, however, extend beyond memory. And, third, it makes it much easier to write big data applications by exposing a rich and expressive API in a variety of languages, including Python, Java, Scala, and R. In particular, it exposes hundreds of APIs, with the “map” and “reduce” API being just two of them.
When it comes to Hadoop MapReduce, Spark is up to 100 times faster, requires between two and five times fewer lines of code to write big data applications, and, functionality-wise, can replace not only MapReduce but also other systems in the Hadoop ecosystem, such as Storm, Mahout, and Giraph
Are there proof points for this?
Spark has already made a significant impact in many production enterprise deployments across a wide variety of use cases, at organizations, such as NBC,Goldman Sachs,Toyota,SK Telecom, Netflix, and Airbnb, with more in the coming year.
What are some of the use cases that Spark excels in?
Spark solves many problems in three broad categories: data warehousing, advanced analytics such as building machine learning models, and processing real-time events. But the use cases where Spark excels are in heterogeneous data environments where teams must combine these use cases to solve the really hard data problems. This allows data practitioners to leverage their existing skill sets to solve more complex data problems outside of their traditional boundaries. A data warehousing analyst can leverage their SQL skills into streaming; a data scientist knowledgeable with Python can now handle workloads beyond a single machine to solve complex big data problems, etc.
What is Databricks’ relationship to the Apache Spark project?
Spark was created by Matei Zaharia, Databricks’ CTO and co-founder, as part of his research project at UC Berkeley a little more than 5 years ago, prior to the company being founded. All six of our co-founders worked on the project at UC Berkeley. Databricks, the company, was only founded in July, 2013. In 2015, we continued to be the largest contributor to the Apache Spark project, with 10 times more code contributions than any other company. In the time since general availability in June, 2015, the Databricks platform has been adopted by more than 200 paying customers, making it the Spark platform with the largest number of customers among any enterprise vendor. Databricks has trained more than 20,000 Spark developers in 2015, again more than any other company. And, students have spent hundreds of hours leveraging the high quality Spark content for MOOCs and tutorials developed by Databricks.
Some people have said that Spark may actually surpass Hadoop in terms of adoption.
In the Spark Survey that we released in September 2015 we highlighted that Spark use is growing beyond Hadoop. To our surprise, only 40% of Spark deployments use Hadoop YARN, suggesting an increasing number of Spark deployments beyond Hadoop. Spark usage in the year. While some run Spark in on-premise Hadoop clusters, they are no longer a majority of its users. As well, Spark integrates with many storage systems (e.g., Cassandra, HBase, and S3). Spark is also pluggable, with dozens of third-party libraries and storage integrations.
In the recently released Stack Overflow Developer Survey, the results also indicate that the demand for and the desire to gain Spark skills is surging past Hadoop.
What is contributing to the strong uptake of Spark now? Is it the ability to use some of the more popular programming languages that is making it more accessible?
Spark fulfills the pent-up demand for flexibility and simplicity in big data programming frameworks. It is flexible enough to unify real-time streaming, advanced analytics, and traditional SQL in the most popular languages Python, R, Scala, Java. It is also simple to learn and use - often a fraction of the code needed compared to previous systems such as Hadoop. Most importantly, Spark has also embraced the community of developers who wanted to contribute, as there has been over 1,000 developers contributing to Spark as of December 2015.
What are some of the additional capabilities that Spark has added recently?
Some of the recent major developments that have gone into the project include new APIs for fast and simple data processing such as DataFrames and Datasets; support for the R language; powerful machine learning capabilities: platform APIs; Project Tungsten and performance optimizations; faster Spark streaming; and new graph processing capabilities.
What’s ahead in terms of Spark development that users are most anticipating?
Matei Zaharia previewed the roadmap in his talk at Spark Summit East. Some specifics include optimization improvements as part of Project Tungsten, leading to further performance improvements; improvements to Spark's real-time streaming engine – “structured streaming” – which is a high-level streaming API built on the Spark SQL engine and aimed at taking advantage of the Tungsten optimizations, as well as other built-in optimizations; and unification of the structured data APIs Spark uses for Datasets and DataFrames in a single API.
If you had one piece of advice for someone embarking on a big data project, what would that be?
The two frequently unnoticed levers in maximizing big data project ROI are the time-to-value and the long tail of hidden costs. You should make sure that the technology platform can immediately deploy production use cases that have a material impact on the business on the one hand -- and is low maintenance enough that you can minimize dependence on scarce technical talent.
Image courtesy of Shutterstock.