Hadoop Fundamentals and Key Technologies in the Evolving Hadoop Ecosystem

With all the talk about “big data” in the last few years, the conversation is now turning to: What can be built on this platform?

It isn’t just about the analytics—many people talk about data lakes, but in reality, organizations are looking beyond the data lake. They’re looking for a solution that has a flexible infrastructure that quickly enables finding and linking the right information; gives end users self-service access to data without needing to become experts in SQL and complex database schemas; and universally and consistently enforces fine-grained privacy and security. These types of platforms can scale easily and are very cost-effective, so it’s time that organizations start building business applications on top of this big data platform.

Data Integration Is Key

One key hindrance has been data integration. Today’s data-rich organizations require their applications and data integration to work closely together. In fact, most data scientists spend only 20% of their time on data analysis. The rest is spent on data integration—and as you can imagine, it’s quite expensive to have data scientists spending 80% of their time on data integration.

In order to gain a competitive advantage, organizations need to rely on timely, relevant, and trustworthy data for their top business initiatives. The success of any big data project really depends on having a data platform that supports the widest variety of data processing, analytics, and applications. If a company can provide fast and secure data to decision makers, it will be able to shorten its data-to-action cycle, stay ahead of the competition, and drive business results.


For more insight into big data technologies, read the new Big Data Sourcebook


While Hadoop can now be considered reasonably pervasive throughout the industry, it doesn’t deliver on industry standards. Getting data into Hadoop requires extra work to create adapters to move data in and out. This can present itself as a barrier to many users trying to work with Hadoop and eat into the amount of time to bring new ideas to fruition.

Reliance on open standards such as NFS and POSIX is the best way to leverage data integration into a big data platform, but it’s the ability to converge the data on one single platform that makes all the difference. In the past, databases typically ran in one place, while the data warehouse was housed in another location, and the business applications were—you guessed it—scattered all over the place. By implementing a converged data platform, you have real-time data transport capability embedded in the data platform. It helps you avoid a costly patchwork of data silos by having a single, global, always-on data system. Once a converged data platform is in place, suddenly the doors are opened in terms of considering how to implement new business applications.

Databases

Databases in the big data space are typically referred to as NoSQL databases. Many of the NoSQL databases may actually be queried with SQL via tools such as Apache Drill. In this space, one can easily leverage the HBase API and make use of HBase tables, MapR-DB tables, or even Google Bigtable. This model is extremely fast, linearly scalable, and delivers consistent data access patterns.

There has also been a lot of activity around OJAI, the Open JSON Application Interface. This defines an API for operating on documents stored in JSON format. MapR-DB, for example, implements the OJAI interface to expose a document database. The ability to store autonomous records in JSON format in a document database is a very compelling option for application engineers. One line of code is all it takes to (de)serialize a data structure in JSON format, and this greatly simplifies the application development lifecycle. Additionally, this enables performing analytics in place without transformation and leveraging data at scale by removing the need to be concerned about how to scale the database behind your application. Testing times are reduced, and applications can be promoted into a production environment in a more timely manner.

When considering building an application on top of a converged data platform, the topic of an RDBMS will come up. It is becoming more common for people to build applications on this platform that are architected as microservices. RDBMSs have been a hindrance to the core model of microservices, as most microservices depend on databases of some sort. A shared database schema causes fragility and is problematic for microservices due to the nature of them being decoupled and deployed independently. There is also a significant burden associated with deploying and managing the total volume of RDBMS to support microservices.

Many of the applications being built every day still require a database for persistence. It is important to be aware of the needs of these database applications. Applications which do not require transaction support are great candidates for a document database. A big point of contention in this space is that most data is mapped to multiple tables in an RDBMS, which requires a transaction to safely update a single record. That is not a core requirement in a document database, as the entire record is written as a single entity. The one core requirement that developers will look for in any database is support for CRUD (creating, reading, updating, and deleting). Due to the simple OJAI API, these actions are accomplished in far fewer lines of code than a traditional database.

Many NoSQL databases require a record to be pulled out of the database, updated, then pushed back into the database. Document databases such as those based on the OJAI API support updating documents directly in the database. The most prominent benefit of this approach is no longer requiring the developer to create an object-relational mapping (ORM) of in-memory data structures to some number of tables. ORM is error-prone and requires a significant amount of time and quality assurance effort to ensure the persistence layer of the application is working properly.

What’s Old Is New Again

Service-oriented architecture (SOA) and the enterprise service bus (ESB) model are terms from the past. Some people had good experiences with these technologies; others, not so much. Even if you remember the failed entry of SOA about 10 years ago, it is important to remember that trends of the past tend to crop up again, and they often inspire even greater innovations. Nowadays, new technologies such as Kafka and MapR Streams, which are publish/subscribe messaging platforms, are taking the big data world by storm. Some may compare them to ESBs from the past. While they may look similar on the surface, it’s what is under the covers that make up the differences:

  • In the past, organizations would take a batch-oriented approach, where they take all of their data—every hour, every day—into the system, and then they do the work of processing and analyzing the data.  With a stream-based architecture, you can do that in real time.
  • With the addition of the Kafka API and MapR Streams, any type of distributed workload can be handled, ranging from the bulk processing of Hadoop applications (MapReduce, Hive, HBase, etc.) to real-time stream processing using Spark, Storm, or any other data streaming capability. It’s really becoming an operating system for data and a global system of record. With a stream-based approach to designing architecture for big data systems, you gain greater control over who uses data and how you can build new parts of your system as you go forward.

A Return to a Service-Oriented Architecture

What was once service-oriented architecture has morphed into microservices environments. Microservices are essentially applications that are broken down into small pieces. Using microservices, businesses can prevent large-scale failure by isolating problems, and save on computing resources. The evolution of the idea behind microservices started as a bit of a backlash against the complexity of SOA and ESB practices. Over the last decade, there has been a strong movement toward a more flexible style of building large systems. The microservices approach has become more the rule than the exception, at least partly because companies who adopt this style of architecture move faster and compete better. The idea behind microservices is straightforward: Larger systems can be built by decomposing their functions into relatively simple, single-purpose services that communicate via lightweight techniques. These services can be built and maintained by small, very efficient teams.

When you start thinking about all of the technologies that go into being able to scale systems and manage your persistence stores, and when you start decoupling these with the messaging platform, it actually simplifies your architecture. Message-driven architectures and service-oriented architectures are not brand new. They have been around for a long time and will continue to flourish. What’s happening is that these technologies and architectures are becoming more inter-related as they mature.

Message-driven architectures sitting on a converged data platform open up the opportunities to simplify application development and deployment at scale. Data movement is built into the platform by supporting the Kafka API to publish messages, and consumers can be scaled out with relative ease to handle any size workload. The converged data platform also has the database waiting to be used at scale. This enables analytics to be performed in the same place, thus closing the loop on necessary capabilities to drive the business.

What’s Ahead

Everybody has been really comfortable with batch for a long time. These days, however, it’s about streaming events that are constantly occurring, and having to deal with it one event at a time. The capability to handle things coming through as they happen is critical. It’s not really much of a surprise that streaming technologies are as big as they are at this point. Being able to move from batch to real time is not terribly complicated, depending on your use case. And, as organizations get more comfortable with these technologies, they are starting to realize that they can do more and more.

Today, organizations need features and functions that can augment their existing enterprise data warehouse strategies. Organizations are turning to the idea of a converged data platform that is capable of managing and processing internal and external data of diverse types in a variety of formats, in combination with data from traditional, internal sources. A converged data platform will be the key to powering the next generation of data-driven applications. For a broad set of applications and uses, these organizations need a platform that integrates Hadoop and Spark, real-time database capabilities, and global event streaming with web-scale storage, so that they can develop, deploy, manage, and secure data applications in a single cluster.

When we’re looking at all of this data, we need to have the scale-out that we require: a decoupled, message-driven, service-oriented architecture and a messaging platform that can actually support the volumes that we need. In the message queues of yesteryear, you would have to worry about that. Nowadays, you can take advantage of a message-driven, service-oriented architecture that supports messaging, files, and tables and which makes  asynchronous microservices much easier to build and maintain.

From a cost perspective, this model is a big money-saver. It delivers better utilization of computing resources, and it takes less time to build the applications on this stack which scale out-of-the-box—all of this, while spending in IT is flat. These capabilities directly enable businesses to continue innovating without the need to worry about higher costs.

It’s clear that there are a lot of technologies to keep an eye on. Be thoughtful about how you leverage these new technologies. They bring with them the ability to think differently by simplifying business processes, which can enable a business to directly integrate analytics into core business functions.


For more insight into big data technologies, read the new Big Data Sourcebook 




Newsletters

Subscribe to Big Data Quarterly E-Edition