Shedding Light on Your Mainframe Data with Hadoop

Jun 4, 2014

By Jorge Lopez

During the early 1990s, in light of cheapening computing resources due to to the proliferation of personal computers, analysts predicted the death of the mainframe. More than 20 years later, mainframes are still going strong.

Over 70% of Fortune 500 organizations, including the top 25 banks, nine of the world’s top insurers and 23 of the top 25 U.S. retailers still rely on mainframes for their most critical applications – processing upwards of 30 billion business transactions per day.

Some of these organizations process up to 80% of all their corporate data with mainframes. In many cases, this information starts with transactional records being processed as part of an online system, for example CICS with DB2. The content is usually much more critical than many of the new “sources of data.” Content includes: credit card records, ATM transactions, package- tracking information, detailed records of wireless calls, healthcare records and more.

Needless to say, these organizations know they cannot ignore this data, but at the same time, mainframe’s storage and processing resources are extremely expensive. That’s why many of these businesses are looking for ways to rationalize their mainframe capacity in order to contain MIPS growth as well as defer costly upgrades. Unfortunately, some of these stopgap measures involve costly trade-offs, such as keeping only the most recent or even archiving critical data to tape. In the end, this means lots of valuable information goes untapped.

When Big Iron Meets Big Data

The good news is Hadoop offers a highly scalable and cost-effective approach that will shed some light onto this previously locked data. Interestingly enough, Hadoop and mainframes have some important similarities. They provide a highly distributed framework to process massive volumes; both are very good at processing batch workloads and rely strongly on sort. Up to 80% of all batches processing in mainframes is highly dependent on sort. Similarly in Hadoop – both the Map and Reduce phases involve a sort step.

However, in order to be successful with Hadoop and mainframes, it’s important to understand the gaps between these platforms. First, there are integration gaps; Hadoop offers no native support for mainframes. Moreover, they use different data formats – such as EBCDIC and packed decimal for mainframes vs ASCII for Hadoop. Next, you have a huge skills gap, both mainframe and Hadoop skills are in high-demand, but difficult to find. If finding a COBOL developer is difficult, finding a Pig or Java developer who also understands mainframes is like finding the needle in a haystack. Finally, you have security gaps; mainframes manage some of the most sensitive information, while Hadoop manages a wide range of data sources, anything from harmless tweets to confidential online information.

Four Practical recommendations to address these differences include:

Identify batch workloads suitable for offload: Mainframe applications typically include a combination of batch and transactional processing (OLTP). Hadoop applications are mostly batch oriented, but more analytical in purpose. The most important difference is that mainframe applications are mission-critical and thus constantly need to accurately and reliably operate. Analyzing SMF records will provide valuable information about expensive batch workloads that are possible targets for offloading into Hadoop.
Identify critical data suitable for offload: Big data and Hadoop initiatives may center around capturing and processing unstructured and semi-structured data coming from web logs, social media and other sources-these are the types of data that influence or lead to a transaction. Mainframes then process and capture these transactions, generating critical data that provides reference and valuable context to big data.
Address security concerns: Mainframes are some of the most secure platforms and that’s the reason why they manage such sensitive data. This explains why mainframe developers are keen about the security and integrity of every single transaction. While Hadoop continues to make progress in this area, it’s still important to be weary when dealing with mainframe data. Mainframe administrators are reluctant to allow access or even install third-party software without a “mainframe pedigree.” Therefore, any viable approach to offloading mainframe and workloads to Hadoop must include an infrastructure that guarantees secure data access and storage.
Minimize data movement overhead: Before Mainframe data can be analyzed in Hadoop; it needs to be moved and transformed. Depending on the amount of data and required frequency of data loads, moving it alone can be costly. Leveraging tools that provide direct access to the mainframe with minimum overhead is essential. Another good practice consists of offloading some of the data preparation to zIIP engines in the mainframe as an effort to reduce MIPS while still complying with established SLAs.

How to Start Offloading Data to Hadoop

When you approach offloading, first take baby steps in order to build your expertise. Organizations can start by creating an “active archive,” copies of selected mainframe datasets in HDFS. The next step involves fully migrating larger amounts of data, even from other sources such as relational databases and semi-structured sources. A final step is focusing on shifting expensive batch Mainframe workloads to Hadoop.

Offloading to Hadoop is not trivial, but the potential benefits are significant. Many organizations today spent more than $100,000 per TB, per year just to lock up their data by backing it up to tape. The opportunity cost is even greater than that, given all the insights that will never see the light. It costs approximately $1,000 to $4,000 to manage the same amount of data in Hadoop, but also provides you with data that is readily available for you to explore and identify new business opportunities.

About the author:

Jorge Lopez, Director of Product Marketing at Syncsort, has more than 14 years of experience in Business Intelligence and Data Integration. Syncsort provides fast, secure, enterprise-grade software spanning Big Data solutions in Hadoop to Big Iron on mainframes.