Page 1 of 2 next >>

Hadoop Fundamentals and Key Technologies in the Evolving Hadoop Ecosystem

With all the talk about “big data” in the last few years, the conversation is now turning to: What can be built on this platform?

It isn’t just about the analytics—many people talk about data lakes, but in reality, organizations are looking beyond the data lake. They’re looking for a solution that has a flexible infrastructure that quickly enables finding and linking the right information; gives end users self-service access to data without needing to become experts in SQL and complex database schemas; and universally and consistently enforces fine-grained privacy and security. These types of platforms can scale easily and are very cost-?effective, so it’s time that organizations start building business applications on top of this big data platform.

Data Integration Is Key

One key hindrance has been data integration. Today’s data-rich organizations require their applications and data integration to work closely together. In fact, most data scientists spend only 20% of their time on data analysis. The rest is spent on data integration—and as you can imagine, it’s quite expensive to have data scientists spending 80% of their time on data integration.

In order to gain a competitive advantage, organizations need to rely on timely, relevant, and trustworthy data for their top business initiatives. The success of any big data project really depends on having a data platform that supports the widest variety of data processing, analytics, and applications. If a company can provide fast and secure data to decision makers, it will be able to shorten its data-to-action cycle, stay ahead of the competition, and drive business results.

While Hadoop can now be considered reasonably pervasive throughout the industry, it doesn’t deliver on industry standards. Getting data into Hadoop requires extra work to create adapters to move data in and out. This can present itself as a barrier to many users trying to work with Hadoop and eat into the amount of time to bring new ideas to fruition.

For more insight into big data technologies and trends, get the free Big Data Sourcebook

Reliance on open standards such as NFS and POSIX is the best way to leverage data integration into a big data platform, but it’s the ability to converge the data on one single platform that makes all the difference. In the past, databases typically ran in one place, while the data warehouse was housed in another location, and the business applications were—you guessed it—scattered all over the place. By implementing a converged data platform, you have real-time data transport capability embedded in the data platform. It helps you avoid a costly patchwork of data silos by having a single, global, always-on data system. Once a converged data platform is in place, suddenly the doors are opened in terms of considering how to implement new business applications.


Databases in the big data space are typically referred to as NoSQL databases. Many of the NoSQL databases may actually be queried with SQL via tools such as Apache Drill. In this space, one can easily leverage the HBase API and make use of HBase tables, MapR-DB tables, or even Google Bigtable. This model is extremely fast, linearly scalable, and delivers consistent data access patterns.

There has also been a lot of activity around OJAI, the Open JSON Application Interface. This defines an API for operating on documents stored in JSON format. MapR-DB, for example, implements the OJAI interface to expose a document database. The ability to store autonomous records in JSON format in a document database is a very compelling option for application engineers. One line of code is all it takes to (de)serialize a data structure in JSON format, and this greatly simplifies the application development lifecycle. Additionally, this enables performing analytics in place without transformation and leveraging data at scale by removing the need to be concerned about how to scale the database behind your application. Testing times are reduced, and applications can be promoted into a production environment in a more timely manner.

When considering building an application on top of a converged data platform, the topic of an RDBMS will come up. It is becoming more common for people to build applications on this platform that are architected as microservices. RDBMSs have been a hindrance to the core model of microservices, as most microservices depend on databases of some sort. A shared database schema causes fragility and is problematic for microservices due to the nature of them being decoupled and deployed independently. There is also a significant burden associated with deploying and managing the total volume of RDBMS to support microservices.

Page 1 of 2 next >>


Subscribe to Big Data Quarterly E-Edition