Hadoop, which is marking its 10th anniversary this year, has expanded well beyond its early days as a platform for batch processing of large datasets on commodity hardware. And, although the name has become synonymous with big data technology, in fact, Hadoop now represents a vast system of more than 100 interrelated open source projects.
In the wide world of Hadoop today, there are seven technology areas that have garnered a high level of interest. These key areas prove that Hadoop is not just a big data tool; it is a strong ecosystem in which new projects coming along are assured of exposure and interoperability because of the strength of the environment.
Here, we take a look at the key technologies that are shaping the Hadoop ecosystem and why they are important for the future.
7 Key Technologies Shaping the Hadoop Ecosystem
1. Web Notebooks
Web notebooks are a way to write code within the web browser and have it run against a cluster of servers. Generally, web notebooks can support languages such as Scala and Python, as well as more basic languages such as HTML and Markdown, which allow the creation of a notebook that can be presented more easily. Integration of SQL into web notebooks has also become a more popular feature, although the capabilities of web notebooks vary greatly.
Quite possibly the most popular web notebook currently in use is Jupyter, which was initially called ipython. Due to the growing need for a simple way to write and execute code, Jupyter evolved quickly. It features a pluggable kernel architecture so that it could support more languages that could be integrated into the Jupyter platform. It now supports more than 50 languages with an easy-to-use interface. While extremely popular, this web notebook is limited to a single language within each notebook. Most recently, it has been set up to be able to run Spark code from within the notebook. This makes it a viable candidate in the Hadoop ecosystem—it opens the door to users of Spark and can make use of Spark’s ability to run at scale.
There are a number of other web notebooks, but my personal favorite is Apache Zeppelin (incubating). From the perspective of the ease of getting started, there is no web notebook easier than Zeppelin. However, while Zeppelin’s user interface is simple and powerful, Jupyter provides a more intuitive interface for quickly running code and visualizing data. In terms of feature sets, Zeppelin is quite likely the best out there. One powerful feature is its ability to run multiple languages at the same time within the same notebook, so a single notebook can pull information in a variety of ways. This ability is the way to go for the future, as flexibility is a great thing. Running Apache Flink code in one line, Apache Spark in the following, and executing a SQL query via Apache Drill in the next line is no small feat.
Another powerful feature is Zeppelin’s ability to expose a block of code as a URL which can be included on other pages. Just imagine running code against your dataset (regardless of size) and exposing that information in raw form, HTML, or graph form on another webpage in a dashboard. That power is phenomenal, which makes this software an ideal candidate for a number of use cases.
The only current limitation of these notebooks lies within the realm of security. Currently there is no real security model in these web notebooks, but by putting a web server in front of them, some level of security can be achieved.
2. Algorithms for Machine Learning
The application of machine-learning algorithms is a hot topic, and there are a number of important reasons for this. The first is that most people can see the potential of leveraging machine-learning algorithms to gain more insights into the data they have. Whether creating a recommendation engine, personalizing a website, identifying anomalies, or detecting fraud, the popularity of this area is strong.
Other more well-known libraries that exist in this space which can be “easily” leveraged are Apache Mahout, Spark MLlib, FlinkML, Apache SAMOA, H2O, and TensorFlow. Not all of these are interoperable, but Mahout can run on Spark and Flink, SAMOA runs on Flink, and H2O runs on Spark. Cross-platform compatibility is becoming an important topic in this space, so look for more cross-pollination.
The second and possibly more dominant reason for this area being a hot topic is that there is considerable confusion in this space. While many people understand the problems to which machine learning can be applied, the “how” is a big point of confusion. I’ve heard people say, “Where can I get those models? Are there models available to the public for general purposes?” It isn’t quite as simple as “downloading” a model to use, but it doesn’t have to be overly complicated either.
The best way to gain a better understanding of machine learning algorithms is by reading these free books by Ted Dunning and Ellen Friedman, which cover these topics in a very succinct and easy to consume way. Practical Machine Learning: A New Look at Anomaly Detection and Practical Machine Learning: Innovations in Recommendation can each be read within a few hours.
3. SQL on Hadoop
Apache Hive is the SQL-on-Hadoop technology that has been around the longest, and is probably the most widely used. The Hive Metastore can be leveraged by other technologies such as Apache Drill. The benefit in this case is that Drill can read the metadata from Hive and then run the queries itself, instead of depending upon the Hive MapReduce runtime. This approach is significantly faster and is one of the preferred ways of using Hive.
Hive itself has new features that will support performing inserts and updates to tables. While this is a topic that a lot of people are interested in, this is personally an area of concern. The problem stems from the wrong motivations for such a feature, as misuse of technology happens all too often. Within the big data space, historical data is often stored in files. Performing updates to historical data is a cause for concern since historical data should not change. If the data is not for historical purposes, then it likely lends itself more to transactional processing. Persistence for transactional processing is not well served via files. Transactional use cases are more logically implemented on top of relational or NoSQL database solutions.
Now that you understand the background of SQL on Hadoop, let’s take a look at two technologies that are gaining the most traction in this space: Apache Drill and SparkSQL. If you are writing a program in Spark and can live with the ability to perform a subset of SQL, then SparkSQL should be just fine. If you need to plug into a real BI tool, you are best served with Drill and its support for the ANSI SQL:2003 standard. Drill was built to be a SQL query engine for pretty much anything and runs at scale. Spark is a general-purpose compute engine and also runs at scale. They have drastically different use cases and share a minimal overlap of functionality.
4. Databases
Databases in the big data space are typically referred to as NoSQL databases. This term is suboptimal, as non-relational databases are what are usually being discussed. Many of the NoSQL databases may actually be queried with SQL via tools such as Apache Drill. To be clear, there is nothing inherently wrong with a relational database; it’s just that most people have used them for storing nonrelational data for quite some time, and now the newer technologies have greatly simplified the storage and access of nonrelational data.
In this space, one can easily leverage the HBase API and make use of HBase tables, MapR-DB tables, or even Google BigTable. This model is extremely fast, linearly scalable, and delivers consistent data access patterns. This model was used at Facebook for its messaging platform, as well as at Salesforce, where this model is leveraged for a number of use cases in which strong consistency is required. Additionally, Salesforce also released an open source project called Apache Phoenix that delivers a relational database model on top of HBase.
If that isn’t enough, there has also been a major increase in activity around OJAI, the Open JSON Application Interface. This defines an API for operating on documents. MapR-DB, for example implements the OJAI interface to expose a document database. The ability to store autonomous records in JSON format in a document database is a very compelling option for application engineers. One line of code is all it takes to (de)serialize a data structure in JSON format, and this greatly simplifies the application development lifecycle. Additionally, this enables performing analytics in place without transformation and leveraging data at scale by removing the need to be concerned about how to scale the database behind your application.
5. Stream Processing Technologies
It seems these days that everyone wants their stream processing framework to be “the” framework used. There are so many projects (free and paid) in this space that it can make your head spin: Apache Flink, Spark Streaming, Apache Apex (incubating), Apache Samza, Apache Storm, and Akka Streams, as well as StreamSets—and this isn’t even an exhaustive list.
Apache Storm was once considered the leader in this technology area. While it is true that the use of Apache Storm is declining, the Storm API will likely live a long time. It has now been adopted by private code bases such as Twitter’s Heron, and it is also supported by Apache Flink. This means that any projects written to work against the Storm API are code-compatible with Flink. This is exceptionally great news, as Flink is very fast and supports exactly-once semantics at very high performance rates. There is also a complex event processing (CEP) API available in Flink.
Apache Spark is a popular choice for streaming applications; however, be aware that it is not real-time or event-based. It falls into the category of micro-batch, which is inherently latent. This doesn’t make Spark Streaming bad; instead, think of Spark as a better fit for machine learning-based streaming use cases. For an overview of Apache Spark, you can read my free ebook, Getting Started with Apache Spark: From Inception to Production.
Apache Beam (incubating) is a rising star when it comes to frameworks for both batch and streaming data-parallel processing pipelines. It runs on both Flink and Spark and is worth keeping an eye on.
6. Messaging platforms
While stream processing engines are hot, messaging platforms are probably hotter. They can be used to create scalable architectures and are taking off like crazy across many organizations.
Companies such as LinkedIn have started making messaging platforms cool again. The project it contributed to the Apache Foundation, Apache Kafka, has created a pretty solid and simple-to-use API, and now this API has become a somewhat implied standard. MapR, for example, has recently adopted it with MapR Streams, and it likely will be adopted by other offerings in the not-too-distant future. The top reason that the messaging platform model is so important is that it can support huge volumes of events. Less than 10 years ago, people would get excited about being able to handle 50,000 to 100,000 message events per second on a server. While that was good then, it didn’t instill confidence for using such a platform as a true hub for scalable business applications. While the concept was solid, the hardware-to-message throughput ratio was too low to be cost-effective. MapR Streams with the Kafka 0.9 API has shown that it can handle 18 million events per second on a five-node cluster with each message being 200 bytes in size. Leveraging this capability in a scalable platform with decoupled communications is amazing. The cost to scale this platform is very low, which means a properly built application can scale without re-architecting the entire platform.
Consider going one step further with the ability to perform analytics in place in the messaging platform, without having to perform data movement or having to enable development and quality assurance teams to test with production payloads. The value is tremendous.
7. Global Resource Management
Resource management relates to the ability to constrain the resources (CPU and memory) of an application. Apache Mesos was created to be a general-purpose resource manager for everything in the data center, or even across multiple data centers. Apache YARN was created to be a Hadoop resource manager.
There is some political contention within the open source community in this space, but there is another project which helps put the politics aside, called Apache Myriad (incubating). Myriad enables Mesos to dynamically spin up YARN clusters within a Mesos cluster. The beauty of this approach is that it actually makes YARN more flexible and allows multiple versions of YARN to be run on the same Hadoop cluster. It enables the ability to create walled gardens in which YARN operates. For those who have applications built to run in YARN, this is a great option.
All of the other applications in the data center which are not Hadoop-related can run side-by-side, which directly makes it possible to leverage data center resources for whatever is most important at the moment. Businesses can greatly increase their agility when they are able to use their assets for whatever is most critical at any given point in time.
Converged Architectural Approach
As you can see, there are a lot of technology areas to keep an eye on. Be thoughtful about how you leverage these new technologies. They bring with them the ability to think differently by simplifying business processes, which can enable a business to directly integrate analytics into core business functions. Many of the technologies in the Hadoop ecosystem are considered big data technologies, but they could also be thought of as scalable data technologies that can be used to deliver a converged architectural approach.
This article first appeared in the Summer issue of Big Data Quarterly Magazine.
Image courtesy of Shutterstock