<< back Page 2 of 2

Hadoop Fundamentals and Key Technologies in the Evolving Hadoop Ecosystem

By Jim Scott

Feb 3, 2017

Many of the applications being built every day still require a database for persistence. It is important to be aware of the needs of these database applications. Applications which do not require transaction support are great candidates for a document database. A big point of contention in this space is that most data is mapped to multiple tables in an RDBMS, which requires a transaction to safely update a single record. That is not a core requirement in a document database, as the entire record is written as a single entity. The one core requirement that developers will look for in any database is support for CRUD (creating, reading, updating, and deleting). Due to the simple OJAI API, these actions are accomplished in far fewer lines of code than a traditional database.

Many NoSQL databases require a record to be pulled out of the database, updated, then pushed back into the database. Document databases such as those based on the OJAI API support updating documents directly in the database. The most prominent benefit of this approach is no longer requiring the developer to create an object-relational mapping (ORM) of in-memory data structures to some number of tables. ORM is error-prone and requires a significant amount of time and quality assurance effort to ensure the persistence layer of the application is working properly.

What’s Old Is New Again!

Service-oriented architecture (SOA) and the enterprise service bus (ESB) model are terms from the past. Some people had good experiences with these technologies; others, not so much. Even if you remember the failed entry of SOA about 10 years ago, it is important to remember that trends of the past tend to crop up again, and they often inspire even greater innovations. Nowadays, new technologies such as Kafka and MapR Streams, which are publish/subscribe messaging platforms, are taking the big data world by storm. Some may compare them to ESBs from the past. While they may look similar on the surface, it’s what is under the covers that make up the differences:

In the past, organizations would take a batch-oriented approach, where they take all of their data—every hour, every day—into the system, and then they do the work of processing and analyzing the data. With a stream-based architecture, you can do that in real time.
With the addition of the Kafka API and MapR Streams, any type of distributed workload can be handled, ranging from the bulk processing of Hadoop applications (MapReduce, Hive, HBase, etc.) to real-time stream processing using Spark, Storm, or any other data streaming capability. It’s really becoming an operating system for data and a global system of record. With a stream-based approach to designing architecture for big data systems, you gain greater control over who uses data and how you can build new parts of your system as you go forward.

A Return to a Service-Oriented Architecture

What was once service-oriented architecture has morphed into microservices environments. Microservices are essentially applications that are broken down into small pieces. Using microservices, businesses can prevent large-scale failure by isolating problems, and save on computing resources. The evolution of the idea behind microservices started as a bit of a backlash against the complexity of SOA and ESB practices. Over the last decade, there has been a strong movement toward a more flexible style of building large systems. The microservices approach has become more the rule than the exception, at least partly because companies who adopt this style of architecture move faster and compete better. The idea behind microservices is straightforward: Larger systems can be built by decomposing their functions into relatively simple, single-purpose services that communicate via lightweight techniques. These services can be built and maintained by small, very efficient teams.

For more insight into big data technologies and trends, get the free Big Data Sourcebook

When you start thinking about all of the technologies that go into being able to scale systems and manage your persistence stores, and when you start decoupling these with the messaging platform, it actually simplifies your architecture. Message-driven architectures and service-oriented architectures are not brand new. They have been around for a long time and will continue to flourish. What’s happening is that these technologies and architectures are becoming more inter-related as they mature.

Message-driven architectures sitting on a converged data platform open up the opportunities to simplify application development and deployment at scale. Data movement is built into the platform by supporting the Kafka API to publish messages, and consumers can be scaled out with relative ease to handle any size workload. The converged data platform also has the database waiting to be used at scale. This enables analytics to be performed in the same place, thus closing the loop on necessary capabilities to drive the business.

What’s Ahead

Everybody has been really comfortable with batch for a long time. These days, however, it’s about streaming events that are constantly occurring, and having to deal with it one event at a time. The capability to handle things coming through as they happen is critical. It’s not really much of a surprise that streaming technologies are as big as they are at this point. Being able to move from batch to real time is not terribly complicated, depending on your use case. And, as organizations get more comfortable with these technologies, they are starting to realize that they can do more and more.

Today, organizations need features and functions that can augment their existing enterprise data warehouse strategies. Organizations are turning to the idea of a converged data platform that is capable of managing and processing internal and external data of diverse types in a variety of formats, in combination with data from traditional, internal sources. A converged data platform will be the key to powering the next generation of data-driven applications. For a broad set of applications and uses, these organizations need a platform that integrates Hadoop and Spark, real-time database capabilities, and global event streaming with web-scale storage, so that they can develop, deploy, manage, and secure data applications in a single cluster.

When we’re looking at all of this data, we need to have the scale-out that we require: a decoupled, message-driven, service-oriented architecture and a messaging platform that can actually support the volumes that we need. In the message queues of yesteryear, you would have to worry about that. Nowadays, you can take advantage of a message-driven, service-oriented architecture that supports messaging, files, and tables and which makes asynchronous microservices much easier to build and maintain.

From a cost perspective, this model is a big money-saver. It delivers better utilization of computing resources, and it takes less time to build the applications on this stack which scale out-of-the-box—all of this, while spending in IT is flat. These capabilities directly enable businesses to continue innovating without the need to worry about higher costs.

It’s clear that there are a lot of technologies to keep an eye on. Be thoughtful about how you leverage these new technologies. They bring with them the ability to think differently by simplifying business processes, which can enable a business to directly integrate analytics into core business functions.

<< back Page 2 of 2