<< back Page 2 of 3 next >>

Grasping Hadoop’s Full Potential

While most major applications have not yet begun to migrate to Hadoop or NoSQL data stores, this trend is expected to grow as the benefits of having a unified data repository capable of data reuse without duplication or latency are realized. However, early adoption has started: Applications have begun migrating historical data to Hadoop with continued access by the business (for low-risk efficiency benefits), and data warehouses have begun to extract operational source databases in Hadoop as a staging area for data integrations and access by data scientists who require raw atomic data.

It is important to visualize analytic applications alongside transactional applications in a horizontal data operating system and Hadoop data platform. Decades spent representing the waterfall, movement, and duplication of data could give way to data being created and analyzed in Hadoop with operational, transactional, and analytic YARN applications accessing it. Analytic workloads and business context will still be required for business intelligence and analytic needs of the business, but latency due to data migration will increasingly become a thing of the past as “store once and use many” becomes the new paradigm. This is already an economics and performance constraint (and is usually not an option) to consider when extracting, moving, or duplicating big data volumes. Over time, “store once and use many” will become the default for all new application development, as they will require analytics and integration with data from the rest of the enterprise for data discovery and insights.

Next-generation transaction event data from behavioral data capture and the Internet of Things (IoT) comes with orders of magnitude larger data volumes that leave no other option for storing and processing the data besides a platform such as Hadoop. In addition to the advanced analytics and data mining opportunities that can come from all the new behavioral data, relating it to the attribute-rich data found within operational systems (i.e., CRM or MDM) can bring significant insights otherwise missed in the high-volume event stream. Therefore, we expect to see more of these applications live in Hadoop to meet the economics and data management principles of big data.

Future Hadoop engines and applications will require that transaction integrity be maintained in order to replace the current generation of RDBMS-based applications.

Just as Hadoop’s first and second generations were driven by business demands, the third generation of Hadoop will offer improvements that fuel the adoption of the data lake and data operating

system strategies within the enterprise data architecture. Likewise, YARN applications will rapidly evolve, mature, and innovate to meet application needs. The Hadoop ecosystem enables innovation in Hadoop engines and YARN applications that will allow some to increase capabilities and performance (e.g., Apache Hive) while other engines continue to mature and develop (such as Apache Drill).

As different YARN applications for SQL access continue to improve, a critical factor will be those optimized for transaction processing with both record read and write operations with transaction integrity and atomicity, consistency, isolation, and durability (ACID) capability. This will enable current SQL-based applications to migrate to the Hadoop platform. And, as these migrations began to happen, Hadoop will enable new benefits to be realized from having the data already in the data operating system for other applications to leverage, rather than deciding whether it’s cost effective and timely to move data.

Ultimately, the industry will see the arrival of new Hadoop engines and YARN applications to leverage—and, like 5 years ago, another 5 years ahead could bring new options that we can’t even fathom today. Enterprises should be excited to think about what can be achieved with a Hadoop data lake or data operating system strategy in their strategic IT and data architecture road maps.

Operationalizing Hadoop

Hadoop has already been running mission-critical production applications at companies, especially those in internet search and ecommerce. However, transactional workloads were previously not the primary focus of the Hadoop community. (The exception here would be HBase, which drives applications requiring high insert and retrieval rates, or puts and gets.)

Relational transaction systems are based on ACID capabilities of a record or transaction to maintain integrity in most RDBMS. In a high-user and multi-tenant environment where many systems are updating data, the ability to correctly maintain data consistency is critical. Future Hadoop data engines and applications will require that transaction integrity be maintained in order to replace the current generation of RDBMS-based applications.

Transactional systems also have requirements for backup, recovery, fault-tolerance, and disaster recovery scenarios from the Hadoop clusters that are typically an isolated system in the data center. These requirements will be met in third-generation Hadoop, and compatibility with existing IT tools that perform these functions for the incumbent RDBMS systems is a factor in accelerating the adoption of next-generation data operating systems.

<< back Page 2 of 3 next >>


Subscribe to Big Data Quarterly E-Edition