Spark is Hot at Strata + Hadoop World

One of the noticeable changes this year at Strata + Hadoop World 2015 was the rise of Apache Spark, an engine for large scale data processing. In recent months, many companies have extended support to Spark, which can be complementary to Hadoop, but can also be deployed without it.

While Spark gained attention last year, "it was basically an idea, with a handful of practical implementations," observed Joe Caserta, president and CEO of Caserta Concepts, a big data, data warehouse, and business intelligence consulting organization. In the last 2 years, the focus has shifted from purely relational databases to Hadoop, with some NoSQL, and now even that focus is shifting again to Spark.

Spark sits on Hadoop, but Spark is essentially its own thing, Caserta said.  “Personally, every time we use Spark we are using HDFS which is the Hadoop distributed file system for storage but you don’t need to, and we are hearing that other people are not. Hadoop put the relational databases to bed and now Spark may be putting Hadoop to bed.” It will be interesting to see if Hadoop is really what people are talking about next year, Caserta added.

“A lot of people are talking about doing Spark without Hadoop at all. We did some internal measurements and we measured that about 30% the installed base of Spark was without Hadoop,” agreed Ryan Peterson, chief solutions strategist at EMC. The majority of that is probably through Databricks, the curator of Spark, he said, observing however that the level of Spark without Hadoop may change “as Hadoop picks up Spark and does more with it.” There is also the question of how many people will use Apache Mesos with Spark, or use Apache Yarn with Spark as Hadoop does a better job of integrating spark into Yarn, he added.

“We are seeing customers get the benefit of the core Spark libraries particularly for machine learning as a starting point, especially when you are doing iterative passes over the data because the data is being held in memory. That is one of the sweet spots for that technology,” said Tim Hall, vice president of product management of Hortonworks. “We started supporting Spark earlier this year and we continue to try to provide the latest and greatest releases for that technology as we believe it is stable for consumption by customers - and that means we have to do our own testing, certification, and validation of it.”

Right now, Spark is one of the hottest technologies, said Hall, who joked that Hadoop is moving at warp factor 7 and Spark is moving at warp factor 9.

Spark may actually surpass Hadoop adoption, said Caserta. "It is very possible. It is so much easier than MapReduce,” added Caserta. The big difference between this year and last year, Caserta noted, is that last year, Spark was not supporting the programming language Python and it does now support Python, which most people know. “They don’t know MapReduce, and Python is just an easier language to learn,” he said. When developers are faced with the prospect of having to learn Pig, Hive, and MapReduce, it presents a challenge to Hadoop adoption, he explained. “But not with Spark. You can use Python, you can use SQL, and you can use Scala which is very similar to Java, and so if you are a Java programmer, you can learn Scala pretty easily.”

Image courtesy of Shutterstock


Subscribe to Big Data Quarterly E-Edition