Powering the Internet of Things with Real-Time Hadoop

Apr 23, 2015

By Monte Zweben

<< back Page 3 of 3

Typically, issues of data storage capacity and data processing power can be solved with a huge checkbook and a traditional relational database; but today’s biggest database innovations are coming from smaller players who are building modern data architectures on platforms like Hadoop, that are designed to cost-effectively handle petabytes of data by scaling out on commodity hardware.

Scaling Out the Internet of Things

The Internet of Things application examples illustrate the value that multiple streams of up-to-the-second data can have when paired with historical data in an application. This requires a database with solid scalability, performance and integration. There are several ways to power a database that meets these requirements such as traditional RDBMSs, Hadoop and now, real-time Hadoop.

Traditional RDBMS: Fully capable, but at great cost

At one end of the cost continuum are traditional relational databases such as Oracle or IBM DB2 that run on expensive, specialized hardware and can be scaled up with more compute power to handle just about anything an application can throw at them. Due to costs, systems like these are out of reach for many small and mid-size enterprises, let alone startups that are building applications or devices that need to handle potential large volumes of data. Even large enterprises are chaffing under the cost of scaling up their trusted legacy databases to handle data volumes that could hardly be conceived of twenty years ago.

Hadoop: Affordable, scalable but over its head

At the other end of the spectrum are solutions built on Apache Hadoop, an open source framework for the distributed storage and processing of data. Hadoop is great at scaling out on commodity hardware, giving users the ability to grow system capacity in an affordable manner. However, off-the-shelf Hadoop falls short when it comes to real-time processing of data, as it was built as a batch processing analytics system. Users can only tap into data post-processing, meaning that they’re dealing with data that can be minutes, hours, or days old.

Real-Time Hadoop: Powering up Hadoop for new challenges

When discussing the previous use cases along with other solutions built on the Internet of Things, working with real-time data is a critical piece of the equation. Because of its popularity as a big data platform, there’s a desire to get more from the Hadoop framework, and the data contained within. This has been a driving force for the creation of tools that are designed to go beyond batch processing and enable real-time data updates and access with the proven scalability of Hadoop. Some examples include:

Apache projects like Apache Storm (real-time data processing), Apache Spark (in-memory cluster computing), that enhance the Hadoop stack with new streaming capabilities.

Transactional RDBMSs on Hadoop that finally allow real-time, concurrent applications to be hosted at petabyte scale affordably. These architectures let IoT applications elastically scale as the cluster grows without taking down the application.

Leading Hadoop distributors such as MapR, Cloudera and Hortonworks are also working to push Hadoop beyond its traditional batch analytics boundaries to power operational applications that can harness the Internet of Things.

Leading edge, data-driven companies like Google, Facebook and Salesforce.com have also been improving and extending Hadoop for use inside their infrastructures, with some of the most demanding real-time applications in use today.

Real-time Hadoop has the ability to handle the Big Data demands of the Internet of Things along with the performance needed to create valuable applications that leverage the data. As it continues to grow beyond its batch processing roots, it is well positioned to be the modern platform of choice for powering data-intensive operational and analytic applications that can tap into the 30 billion connected devices of the near future. Perhaps the machines will take over someday, but for now, don’t be surprised to see a little yellow elephant named Hadoop in command.

About the author:

Monte Zweben is co-founder and CEO of Splice Machine.

<< back Page 3 of 3