Is the day of reckoning for big data upon us? To many observers, the growth in data is nothing short of incomprehensible. Data is streaming into, out of, and through enterprises from a dizzying array of sources-transactions, remote devices, partner sites, websites, and nonstop user-generated content. Not only are the data stores resulting from this information driving databases to scale into the terabyte and petabyte range, but they occur in an unfathomable range of formats as well, from traditional structured, relational data to message documents, graphics, videos, and audio files.
Just a few years ago, it was eye-opening when some organizations reported having databases topping the one-terabyte (TB) mark. Now, for many organizations, data is going into the hundreds of terabytes. In a recent survey conducted by Unisphere Research among members of the Independent Oracle Users Group (IOUG), one out of five respondents report that the total amount of online (disk-resident) data they manage today-taking into account all clones, snapshots, replicas, and backups-tops 100TB.
There are several courses of action data managers need to recognize and put into place to help their organizations get this massive surge of information under control. It means throwing out outdated assumptions about relying exclusively on relational database management platforms, and constantly beefing up vast centralized storage arrays. It means actively looking for ways to make as much data as possible actionable from a business perspective, and implementing smart data management strategies, from virtualization to information lifecycle management.
Big Data Sources
First of all, where is all this data coming from? Much of it is flowing in from the web, social media, and other new sources. "A huge piece of the Big Data pie is unstructured information, all of the data that doesn't fit into traditional columns and rows," Ken Bado, CEO of MarkLogic Corporation, tells DBTA. Such data includes emails, images, log files, cables, user generated content, documents, videos, blogs, contracts, wikis, and web content, he adds. "Organizations are just beginning to realize the potential challenges resulting from the explosion of this type of data. Increasingly, it is these information mediums that governments need to account for to keep their citizens and soldiers safe, financial organizations need to integrate to make wise investments, and media companies need to conquer to reinvent their businesses."
That's one part of the story. Call it what you will, but much of the data streaming into organizations can be classified as "machine data"-arising from growing networking infrastructure, new edge devices, and expanding sensor networks, as Tim Negris, vice president of marketing for 1010data, describes it.
Some industry experts also argue that the challenge isn't just about the sheer enormity of data, but also the way in which businesses handle this information. "I would argue that the term ‘extreme data' is far more applicable than ‘big data' for most businesses," Sean Jackson, vice president of marketing for Kognitio, tells DBTA. "It's not just about the volume, it's about the speed at which you can get value from the data and how quickly you can pull your various data sets together and derive value. However you say it, businesses today are no doubt dealing with an influx of data that could be, and should be, analyzed if an organization wants to truly understand its operations, customers, and potential areas of growth."
Oftentimes, there's nothing new about the data that is encroaching on enterprise systems and process stability. There's just lots more of it. "Companies are identifying more and more kinds of metrics, whose tracking just was not possible 10 years ago," Jason Wisdom, a data management consultant, tells DBTA. "They want to put all this data into a database, and then run analyses or mining to extract trends, QA check, and generate reports."
Wisdom illustrates examples of how data can quickly grow beyond the bounds of data environments. "An electrical engineering company may run chipset tests by the thousands, and gather statistics and metrics that equal half a gigabyte of text per test," he relates. "Multiply that half a gigabyte by 2,000 tests, and you have a lot of data. This volume is getting entered into the database on a regular basis." In another example of data suddenly bursting at the seams, a hospital may have "ridiculous amounts of data coming in from insurance carriers," Wisdom explains. "Whereas before they would store the data in hundreds of Excel files on some fileshare, now they want to import it all into a single database and use the database to spit out results."
The scenarios in which relatively small increments of information can add up to staggeringly big data are numerous, and some can even be mind-boggling. "A single company may only generate a moderate number of sales transactions, but imagine a government agency that wants to look at all transactions by all companies," Sid Probstein, CTO of Attivio, Inc., tells DBTA. "Big data often occurs because of national or global scale. For example, if that government agency is interested only in the U.S., then you are talking 300 million people with an average of, say, three transactions per day. That's 900 million transactions per day to record. If an analyst wants to crunch 5 years of data, that's more than 1.5 trillion transactions."
The 2010 IOUG Database Growth Survey of 581 companies, conducted by Unisphere Research and sponsored by Oracle, confirms that managing data growth is a priority for many companies, but smarter responses are needed to address the challenge. In addition, there is considerable pressure on organizations to retain much of this data for as long as possible and to be able to make the information available as users require. But it's increasingly clear that these enterprises are having difficulties managing the growing volumes of data, and there has also been an impact on application performance.
A majority of respondents in the IOUG survey report having performance and budget issues due to exponential data growth. Those companies with the highest rates of data growth, in fact, are eight times more likely than slow-data-growth sites to be seeing significant increases in their storage budgets. New processes and tools are needed to help organizations take control of the massive volumes of information now moving through their systems. Data is growing rapidly at nine out of 10 respondents' companies, and business growth is driving this expansion in data stores. Sixteen percent of companies are experiencing data growth at a clip exceeding 50% a year.
"Many applications and systems simply can't handle all this new data, or can't handle it cost-effectively," says Probstein. "Some systems can't keep up with the incoming data, others can't process the bulk of the data quickly enough to meet users' needs. Part of the problem is that the mainstay of data warehousing, the relational database, isn't necessarily well suited to store all of these types of data."
The greatest costs that arise from Big Data challenges may not be system costs, but people costs, especially as organizations attempt to cobble together their own solutions and configurations to meet the challenge. Bill Jacobs, director of product management at Netezza, an IBM Company, points out. "With no prior experience in Big Data, it is easy for new users to misjudge the staffing demands that accompany build-it-yourself and open-source approaches to Big Data. CIOs should also be concerned about the risk of becoming dependent on non-commercial platforms while analytic applications become mission-critical to the business."
An overwhelming majority of respondents in the IOUG survey, in fact, say growing volumes of data are inhibiting application performance to some degree. The problem is even more acute at enterprises with the highest levels of data growth. However, most still attempt to address the problem with more hardware, versus more sophisticated or efficient approaches. For example, many organizations engage in "rapid scale-up and scale-out of COTS multicore processors, SANs, and network fabric," Negris tells DBTA. "We also see increasing demand for virtualization, solid state storage, compression, and other technologies."
However, the pace of hardware adaptation and adoption often does not keep pace with information demands from the business. "There is a lot of useful raw data being thrown away or summarized rather than being retained and used," he says. Lately, he adds, companies have been attempting to leverage cloud services but often in a dysfunctional way. "In many organizations, IT ends up fighting, rather than mastering, the use of cloud computing for core applications," he points out. "Business managers end up thinking that the cloud will solve problems that it can't."
Also fueling the relentless growth of big data are legal requirements-or even just the fear of associated litigation-that is compelling organizations to find ways to store all forms of data for extended periods of time. The IOUG survey finds that many companies feel compelled to retain data for extended periods of time-forever in some cases-and are having difficulty making it accessible to end users. "The belief that every single bit and byte of data needs to be stored forever leads to the questions, ‘What is the point of big data?' ‘Is it to store it, or derive value from it?' " says Jackson. "I would make the point that big data is more about value, worth, and money rather than just storing the information forever. After all, do you just want to store your data, or do something interesting with it?"
There are even organizations that now feel compelled to preserve data for 100 years or longer, Deirdre Mahon, vice president of marketing for RainStor, tells DBTA. "Not only are data volume growth rates significant, but also keeping enterprise data online for longer time-periods causes even greater complexity for today's IT group," she says. "With these very high growth rates, applications become bloated, and as a result, require additional storage and processing power in order to keep up with performance expectations."
There are data security implications, as well. "Mobile devices such as iPads, iPhones, and BlackBerries have grown more sophisticated and are penetrating the workplace in increasing numbers, enabling access to information from anywhere, anytime on nearly any device," Paula Skokowski, chief marketing officer of Accellion, tells DBTA. "IT departments need to get ahead of the issue of how employees share information and collaborate otherwise users are left to find their own alternative solutions to share information and collaborate. Organizations are putting themselves at risk for a data breach or non-compliance, as users look to unsecure IT workarounds for sharing digital files such as dropbox-type of applications, instant messaging, thumb drives, and CDs."
The bottom line is that the relational database model for data management and storage-perfected through the decades-no longer works. "Retaining large volumes of historical data in an expensive RDBMS or data warehouse is not the most cost effective approach," says Mahon. "It ties up expensive DBAs and the business value of older data rarely justifies the cost of high-end BI applications and per terabyte stored."
The result is often a hodgepodge of data stores and siloed approaches to moving data through the enterprise. "It's not uncommon for a company to have a combination of home-grown solutions-often insecure, undocumented, and hard to maintain scripts-combined with a variety of tactical and outdated file transfer servers scattered throughout the organization," Hugh Garber, CTO of Ipswitch, tells DBTA. "Not only does this create an administrative nightmare, it also leaves a major gap in terms of the need for company-wide governance and visibility into file movement."
Still, management must be ready for the arrival of big data in the enterprise, and it's urgent that the "inherent tension between IT and business" be overcome, Ernest Martinez, regional practice head of Wipro Consulting's Business Intelligence and Information Management Practice, tells DBTA. "Traditional IT shops rely heavily on standard delivery lifecycles that call for little to no business interaction. Conversely, many businesses would rather stick their head in the sand than understand their data and the sources from which it originates. Optimal use of Big Data requires both partners acting in unison as it is just as important to understand the technologies as it is to understand the business implications of the data. Both sides need to become more agile as they face the influx of data via social media and other sources; the two must work together, like it or not."
Nevertheless, big data does not have to mean hard work just to stave off doom and gloom. The rise of big data provides business opportunities both inside and outside of organizations, says 1010data's Negris. These include "improving effectiveness of business processes, decisions, interactions and collaboration. Providing new business assets that can be monetized and improve decisions." In addition, he adds, "Big data in the cloud is enabling small companies to better compete with big ones."
There are benefits from a business intelligence perspective, as well. "Big data, combined with new data management technologies, enables business intelligence vendors to operate on large volumes of data and even virtually integrate the data with traditional data warehouses," Steve Yaskin, CTO of Queplix, tells DBTA. "IT departments integrating big data with already-stored data can enable new forms of analysis such as forecasting and predictive modeling. However, it is not possible to achieve the desired levels of scalability and performance using traditional RDBMSs when working with large volumes of data."
Consider the case of one of the world's largest internet companies, Charles Zedlewski, vice president of products at Cloudera, says. "In a little over 10 years Google has built a $30 billion business predicated nearly entirely on how they store, process, and analyze petabytes of web data," he tells DBTA. "That's just a hint of the sorts of new business opportunities presented by big data. We've seen Fortune 500 organizations use big data to increase their customer purchase rates, lower their fraud expenses, and provide new value-added data services to their end customers."
Big Data Storage
Leveraging the advantages big data has to offer in terms of global insight and better customer understanding requires smarter data management practices. Consider the storage side of the issue, another matter that makes big data perplexing to many organizations and administrators. A sizable segment of companies with fast-growing data stores in the IOUG survey spend more than one-fourth of their IT budgets on storage requirements.
"Data growth is quickly becoming out of control and drives over-spending when it comes to data storage," Mike Ivanov, vice president of marketing at Permabit, tells DBTA. "Despite increasing IT budgets, the growing costs associated with storing more data create a data affordability gap that cannot be ignored. The response to this growth has been to continue funding expansion and add hardware. Technology advances have enabled us to gather more data faster than at any other time in our history which has been beneficial in many ways. Unfortunately, in order to keep pace with data growth, businesses have to provision more storage capacity which costs millions."
While there has been a relentless push in recent years to store multiplatform data on ever-larger disk arrays, big data demands moving in a different direction. "In contrast to years past, where information was neatly compartmentalized, big data has become widely distributed, scattered across many sites on different generations of storage devices and equipment brands-some up in the clouds, others down in the basement," George Teixeira, president and CEO of DataCore Software, tells DBTA.
As result, centralized storage is "both impractical and flawed," Teixeira says. "It's impractical because it's incredibly difficult to lump all that gargantuan data in one neat little bucket. It's flawed because doing so would expose you to catastrophic single points of failure and disruption." To address widely distributed data storage, he recommends approaches such as storage virtualization software, backed up by mirror images of data which are kept updated in at least two different locations. "This allows organizations to harness big data in a manner that reduces operational costs, improves efficiency, and non-disruptively swap hardware in and out as it ages," he says.
Many industry experts urge greater data optimization-which entails smart strategies such as tiered storage plans in which data is moved off online systems but is still accessible, as well as fundamental approaches such as deduplication, compression, and virtual tape technology. "Learning how data can be a corporate asset and planning accordingly to use the right technology to store and access data would help change the mindset of tolerating data growth to optimizing data usage, enabling increased efficiency," says Ivanov. "Today's data optimization technology significantly reduces the storage needed for big data by eight to 10 times, delivers IT budget relief, performance, and scalability in data deduplication solutions."
Information lifecycle management, or ILM, is another approach recommended to keep tabs on big data. In the IOUG survey, two out of five respondents' companies recognize the value of ILM to better manage storage growth. However, these are the early stages for ILM strategies at most companies. ILM approaches are more common at companies with high levels of data growth, though the most common approach continues to be that of buying new hardware to address the problem.
Ultimately, the ability to bring big data to manageable proportions is a management challenge. The question that needs to be asked up and down the organization is "What do you want to do with all this data?" As Mahon explains, "There are many different technology approaches to solving big data, not only from a storage efficiency standpoint such as leveraging the cloud but also using advanced in-database analytics." She adds, "The most important question is really around the overall use-case and purpose and how you can cost effectively scale and manage for future growth."
Oftentimes, it's simple solutions-such as judicious planning that may help address what may seem like an insurmountable problem. "People expect an old system to adequately perform, maybe by slapping on a new disk array. Better yet, they attempt to put it on the cloud and cause the entire cloud to slow to a halt. What worked for a system eight years ago, holding 20GB worth of data, will not work for comprehensive inputting of 2TB worth of data. New data models are required, and scalable technologies-such as memcache and data warehousing-usually are, as well."
Chris Miller, CIO at Avanade, urges organizations to take a holistic approach to data management. "It's clear many companies lack the basic measures to manage big data, but see huge potential benefits if they can learn to leverage it effectively," he tells DBTA. "Businesses must employ a holistic approach to data management - a new approach for many - and one that focuses on the stages in the data lifecycle, Today, every business is a digital company; and, every customer or employee is a content producer." Organizations need to be able to accommodate and understand the value of such data, then to be able to "determine what information is important and what information does not matter," he adds.
Ultimately, it's important to keep data growth in its proper perspective. "The biggest mistake occurs when an enterprise accepts that big data is revolutionary, rather than merely evolutionary," Joe Maguire, principal analyst/consultant with O'Kelly Associates, tells DBTA. "Few ideas are truly revolutionary; a genuine revolution dismantles everything that precedes it. By contrast, big data will coexist with and supplement the workhorses of information management, relational databases."