Learning to Navigate in an Increasingly Varied Database Landscape
Today's organizations must capture, track, analyze and store more information than ever before - everything from mass quantities of transactional, online and mobile data, to growing amounts of "machine-generated data" such as call detail records, gaming data or sensor readings. And just as volumes are expanding into the tens of terabytes, and even the petabyte range and beyond, IT departments are facing increasing demands for real-time analytics.
In this era of "big data," the challenges are as varied as the solutions available to address them. How can businesses store all their data? How can they mitigate the impact of data overload on application performance, speed and reliability? How can they manage and analyze large data sets both efficiently and cost effectively? While the names of hardware options, new software solutions and open source projects get bandied about, the answer is that there is no one-size-fits-all approach. Figuring out which technologies (or combination of technologies) best fit your data challenge requires a close examination of the pros and cons of everything from row-oriented databases to NoSQL. Here are some key considerations:
To row or not to row? For a long time now, the row-based database has been the standard approach to organizing data. Common examples include solutions from Oracle, MS SQL Server, DB2 and MySQL. When it comes to transactional requirements - billing and invoicing for example, or inventory management - these databases are a good fit. But they run into trouble when it comes to high-volume, high-speed analytics, especially when business intelligence requirements are dynamic and unpredictable. Why? Well, it's not what they were designed for. The nature of row-oriented databases means that all the columns associated with each row of data that's being analyzed need to be captured to run a specific query (i.e., "How many pairs of X brand boots did we sell during our flash sale last week?"). If you only have a few columns of data this may not be a big problem, but what if you have a 100? Multiply those 100 columns by millions of rows and disk I/O becomes a substantial limiting factor. Businesses can then either throw money (and hardware) at the challenge through expanded disk storage subsystems or more servers, or they can throw people (i.e., database administrators) at it by creating indices or partitioning data to optimize queries. But when the questions executives are asking of the data are constantly changing (and time sensitive), manual data configuration is simply not practical. So for static, pre-defined reports and transactional environments, row is probably still the way to go. Other data challenges require different, more flexible technologies.
Columnar databases offer a changed perspective. As the name implies, columnar databases store data column-by-column rather than row-by-row, enabling the delivery of faster query responses against large amounts of data. Most analytic queries only involve a subset of the columns in a table, so a columnar database has to retrieve much less data to answer a query than a row database, which must retrieve all the columns for each row. This simple pivot in perspective - looking down rather than across - has profound implications for analytic speed. In addition, most columnar databases provide data compression. This combination of reduced I/O and data compression has several benefits in addition to query speed, including the need for less storage hardware, which also translates into lower costs. Depending on the solution chosen and the mix of capabilities required, there are technologies out there that can achieve data compression of 3:1 or 4:1 all the way up to 10:1, 20:1, and even 30:1. My company Infobright's solution is a columnar database, as are those from others, such as Sybase IQ and Vertica.
Where does NoSQL come in? A broad, emerging class of non-relational solutions have also evolved to address specific business needs that row technologies can't scale to meet and column technologies are unsuited to address, including things like real-time data logging used in finance, for improving Web app performance or for storing frequently requested Web app data. While the currently 100+ products and open source projects in the NoSQL space address different issues, they all share certain attributes: huge volumes of data and transaction rates; a distributed architecture; and the ability to handle unstructured (or semi-structured data) with heavy read/write workloads. Examples of these technologies include MapReduce (distributed processing of large datasets) and other variants such as Hadoop (open source distributed system for data storage and processing). Because they are highly scalable, NoSQL technologies can deal with the "biggest of the big" data volumes and handle streaming data, but they also generally require specialized skills for set up and administration. In addition, NoSQL technologies can be limited in terms of their ability to execute complex queries. But because many NoSQL variants are open source, they can also be more easily integrated with other data solutions (like columnar databases) that can handle ad-hoc reporting and analysis.
Finding the right match. There's a reason that businesses use purpose-built tools for certain jobs. You don't want your business solutions to use a standard relational database for everything, just as you wouldn't use a screwdriver when you really need a power drill. As the information management landscape evolves, careful consideration should be paid to the objectives of an intended data solution and how the underlying computing components need to come together to achieve it. There's no silver bullet, but understanding and then linking project objectives to the right architecture can mean the difference between a costly failure and an efficient success.
In the case of adMarketplace, an online advertising provider, the extreme data growth that accompanied the success of its business was overwhelming the capability of its MySQL database. The company needed to capture and analyze big volumes of text search, clickstream and conversion data related to the advertising services that it delivers. To deal with the volume and enable fast analysis of the data to optimize ad performance and spend for its customer base, adMarketplace deployed the NoSQL database MongoDB alongside MySQL and implemented Infobright's columnar solution to get the scalable analytics platform it needed. Today, data streams into MySQL and MongoDB databases and is then aggregated into Infobright hourly. adMarketplace analysts use it for sophisticated predictive analytics that enable their advertisers to maximize the return on their ad spend, and to optimize yield for their publishers. This is a great example that shows the best capabilities of various solutions coming together to achieve a specific business goal.
Dealing with big data is going to require a targeted, rather than "one-size-fits-all" approach. What is clear is that many businesses need to think more carefully about their solution set and look at new ways to load data faster, store it more compactly and reduce the cost, resources and time involved in analyzing and managing it. This requires a willingness to try out specialized tools and even experiment with different combinations of data solutions. A little homework in advance will help ensure that companies find the right fit for their data management needs, resources and budget.
About the author
Susan Davis, vice president, marketing and product management, at Infobright, is responsible for the company's marketing strategy and execution, as well as product management. Davis brings more than 25 years of experience in marketing, product management and software development to her role at Infobright. Prior to joining the company, she was vice president of marketing at Egenera, a pioneer in the virtualization of data center infrastructure. At Egenera, she launched the marketing and product management functions and helped grow the company to more than $100 million in annual sales in 5 years.
Prior to Egenera, Davis was director of product management at Lucent Technologies/Ascend Communications where she was responsible for the release and launch of the telecommunications industry's first commercially available softswitch. Earlier, she held numerous positions in product marketing, ISV relations and software development and support at Stratus Computer and Honeywell/Bull. She holds a B.S. in economics from Cornell University.