Big data is one of those terms that is quickly gaining momentum among technologists. If you watch closely, you'll notice that everyone seems to have an opinion on what "big data" means and wants to own the term. As industry experts discuss what to name this problem, in 2011, companies will be tasked with bringing big data from back office offline analytics to customer-facing 24x7 production systems. Customers are paying attention and they need solutions that support not only massive data sets but also mixed information types, extended feature sets, real-time processing, and technical teams that have not hand-coded these systems from the ground up. Here are five big data solution trends we see developing as our customers work hard to solve "big data" or "big information" problems.
We've Pushed Past the Capacity
Data is bursting at the seams, out of runway, filled to the brim - however one states it, we have pushed past the capacity of traditional information management systems. Incremental improvements will and do help, but information growth is fast outpacing those improvements. The impact can be seen across the commodity infrastructure stack; transactional data stores, search engines, and data warehouses are simply unable to manage massive information sets while still providing the service levels customers have come to expect. As a result, the management of many massive data sets is being pushed to non-transactional systems, and in many cases, non-management systems. Hadoop, for instance, has garnered substantial attention as a way to process all this growing information, but they are also slowly being pushed to support data management as well. Good processing systems are not always good management systems, and most companies need a combination of both. This has led to confusion regarding what different solutions should actually be used for. As the capacity story continues to be the pointy tip of the big data problem, traditional solutions in the market will either introduce versions that can play at this scale (MPP architecture anyone?) or position out of the big data space. At the same time non-traditional systems will quickly backfill their functionality and provide traditional features, like management and transactions, which are needed in many critical path production systems.
Handling a Huge Bag of Mixed Information
Customers are working with many different types of information from raw text and precise tables to complex denormalized structures and huge video files. Traditional systems can usually store this information if they know what it is ahead of time, but they have a hard time leveraging the information in a flexible way once it is in the system. Similarly, some of today's big data solutions do very well solving problems involving a single type of information, such as analysis of log files, but cannot easily support arbitrary types of information. Big data solutions will begin to support whatever mix of information a customer throws at it and allow flexible use of this information. This doesn't mean just storing and retrieving different types of information. It means giving customers the flexibility to leverage the information in the way they need to today and in the future, without redesigning the data store. An example of this approach is the "schemaless" database available in some systems that allows one to store information without declaring exactly what it looks like ahead of time. When dealing with massive data sets, rigid schemas that force one to rebuild a database in order to provide a slightly different or extended view are not feasible to manage.
Providing Extended Features
The more information that is put into a big data store, the more customers want to utilize the information in one place. It is simply too painful to move and sync massive data sets between specialized systems. This is putting pressure on big data systems to provide additional services in addition to core services. Customers want a single system that can handle transactions, processing, search, and analytics across their entire data set. While this may sound like finding a holy grail of sorts, big data systems have already started to embed sub-systems that provide a different way to work with today's information. In many cases, Hadoop's technology is becoming a feature of big data systems. This is similar to how full text searching capabilities, often a stand-alone system, have become an embedded feature of many relational database management systems. While a big data system that can be all things to all people may not be in sight, or even possible, solutions will continue to embed and integrate with other best of breed processes to provide more comprehensive, complete solutions. These systems will have a dominant specialty with multiple embedded extensions that round out the solution to support a broader class of problems.
Real-Time is Required
As big data solutions move from back office analytics to customer-facing production systems it is no longer acceptable to wait hours, or days, for an update to be available and usable across the information management infrastructure. Much of what is interesting about big data is the ability to understand what is happening right now. Whether providing trending topics for Twitter, price momentum for stocks, or fraud detection for financial transactions, management and processing of massive data sets in real time is required. This means that big data systems will need to be able to ingest new information and make it available to end users within the same time frame that a relational database does today. This is complicated by the fact that the information being ingested is often a stream, as in the case of stock quotes or tweets, and requests for the new content are also very high volume. Production big data systems need to be able to support heavy reads and writes simultaneously. While industries like financial services with deep technical specialists have been able to craft custom solutions to fit this need, the broad array of customers facing big data problems haven't figured it out yet. They need commercial software that doesn't require a team of internal engineers to constantly develop, let alone a team of database administrators to manage it. Big data solutions for frontline applications will be able to support heavy simultaneous read and write loads with real-time response, sans custom development or management.
People Make It Happen
We often overlook the most important element required when talking about technology solutions: the talented and well-trained people required to run traditional systems. This pool is already fairly limited and highly specialized. Systems that can manage massive information sets are often built from scratch which can naturally lead to a very small number of people that actually know how the system works. This is a very real and scary proposition for teams running their business on these systems. The developer meeting the proverbial bus is generally not replaceable for months or years, the time it takes to hire and train an expert. Big data vendors will necessarily strive to simplify the development and operation of their solution. They will also borrow from traditional information management concepts and functionality in order to provide a smooth on ramp so that non-experts can master their technology as quickly as possible. Scaling is not just a data problem, but a people problem as well. Without the right people, big data turns into an even bigger problem.