Integrating Hadoop and NoSQL for a Real-Time Big Data Architecture

Bookmark and Share

While Hadoop, a platform for big data analytics, and NoSQL, a database technology for enterprise, web, mobile, and IoT applications, may seem different, they do complement each other.

In a recent DBTA webcast, Shane Johnson, senior product marketing manager, Couchbase, discussed the relationship between NoSQL and Hadoop, detailing the multiple ways to integrate NoSQL databases with Hadoop, and explaining reference architectures implemented by enterprises that are successfully leveraging big data in the real world.  

“It’s not Hadoop or Couchbase Server. It’s Hadoop and Couchbase Server,” said Johnson. Instead of viewing it as an “either/or” proposition, users should view the combination of technologies as an opportunity for Hadoop and NoSQL can complement one another.  One of the first methods of integration came from the Couchbase community, said Johnson, who explained that some of  Couchbase’s customers were quicker than even Couchbase itself when it came to integration. Their efforts focused on Storm and Flume, and then once the ecosystem began to mature, Couchbase also began to focus on Spark, Kafka, and Sqoop, he said.

When integrating with Sqoop, data goes through the Couchbase Server, then through Sqoop, and is finally exported to Hadoop. This type of environment is circular meaning that once the data reaches Hadoop, the user can take the results and put it through the Couchbase Server again, and the whole process starts over.

The next integration has the same start with data entering the Couchbase Server, but then the data is streamed into Kafka and exported into Hadoop. “Where Sqoop was more batch-oriented depending on the frequency, Kafka is continuous - so as data is going into the Couchbase Server, it is also going into Hadoop,” said Johnson.

Johnson also covered the database change protocol and how that allows Kafka to work with Hadoop. While the first two approaches were circular and linear, the third approach is essentially a fork. Data can begin in Kafka and then go through Storm, which is a stream processor, then select whether the data moves toward Hadoop or the Couchbase Server. For example; the raw data may go to Hadoop where it can be stored in a data lake while the processed data can be placed in the Couchbase Server for applications.

To view an on-demand replay of this webinar, go here.