Big Data or Right Data: What Really Matters?

Big data is everywhere today. It fills IT headlines and keynotes technology conferences. It’s become a favorite topic for both industry analysts and technology investors. With lots of computing power and better database storage techniques, big data makes it practical to store and analyze petabytes and petabytes of detailed transactional and media data. But despite the headlines, "big data" is not the most compelling data need that the majority of business end users have. A far bigger challenge for most people is getting access to the "right data" to help them do their jobs better.

Databases of the petabyte size mostly represent billions of individual transactions, such as individual telephone calls or ATM transactions. No one would argue against analyzing that data to look for nuggets of insights that can only be found at that detailed transaction level. However, this kind of analysis is not easy. It requires sophisticated models and statistical techniques, and in the wrong hands can lead to all the classic errors of statistical analysis (e.g., correlation is not causation, and 5% of the time random events will be statistically significant at the 95% level). In general, analyzing truly big data needs to be left to the professional analysts.

The vast majority of workers rely on data yet are not analysts but can be considered decision workers. These are the workers that make the numerous day-to-day, sometimes minute-to-minute decisions that make a business operate. They may be in finance, sales, customer support, supply chain, or even on the shop floor. They cannot do sophisticated analysis, but they do need the right data relevant to their decision at hand in an easy-to-use format.

The data that decision workers need varies dramatically from situation to situation. It might be a complete view of a customer on the phone with a support issue; it might be an end-to-end view of the status of a part flowing through the supply chain; or it might be all the data about a mortgage applicant along with data about a particular housing market, plus real-time mortgage rates. The decision workers may be consumers of insights gained by large-scale analysis of big data, but they want that in the form of easy-to-follow decision rules. They may even want to build their own personalized decision rules based on these insights.

Decision workers need the data that is relevant to them in the context of doing their job. For example, a customer rep dealing directly with a customer needs the full details of that customer's previous interactions with the business that are relevant to the current call. This isn't as much a matter of filtering through petabytes of big data as it is of getting together the right data at the right time. The reality, though, is that this data comes from lots of different data sources (purchase histories, support issues, call and email logs, payments, etc.), and is typically not available in one central database. It’s the challenge of combining data not only from multiple large relational databases, but also from small local databases, spreadsheets, documents, emails, and from the web.

The key to satisfying decision workers’ data needs is tapping into these disparate data sources, finding and mapping together commensurate data, and presenting that in an easy-to-digest form. Once you’ve accomplished this, you can apply decision rules to help the decision worker make better operational decisions and in many cases even automate this for greater efficiency. Of course, all of this must be done in a timely fashion, because if you need data for a decision today but don’t get it until tomorrow, you may as well have never gotten it at all. The right data depends very much on the situation at hand.

What’s needed for Right Data?

Today in most enterprises, IT typically has a functional structure organized around different technology categories. For example, IT may be divided into a database group, an ETL group, a business intelligence group, a communications group, and so on. Each of these groups effectively operates as its own stovepipe that prevents the seamless integration needed for right data. Getting all of those groups—each with its own priorities and budget—to satisfy the needs of a particular decision worker community is quite problematic, except for the highest priority use cases. With ever-changing business requirements, it often means never satisfying most decision worker’s data needs.

What are the steps that must be seamlessly integrated to enable decision workers to get the right data when they need it? They are familiar steps, and most IT shops have (separate) tools for each of them. Today, the tools used for these steps are not well integrated together and do not share the same underlying data and process models. Combining these steps to get at data today involves heavy IT intervention, lengthy development cycles, and high costs. Furthermore, decision workers’ need for right data adds unique challenges to many of these steps.

Step 1: Discovering the data. Once you go outside the central large databases, most data is not well documented. There is often no way to even know if the data source exists. This is especially true when it is a spreadsheet, a departmental database, a document, or data external to an organization. The first step then is to start cataloging data sources.

Step 2: Understanding the data. Even if you can find relevant data, it is often a challenge to understand the meaning of the data. Typically different names—often cryptic names—are used for the same type of information. Also, in the case of calculated data, the calculations can be done if different ways, and that needs to be understood before the data can be reliably used.

Step 3: Mapping the data. Data needs to be mapped from the meaning in the data source to a meaning that makes sense for the decision worker. Often that is just a name change, but it can require a transformation or a recalculation. In today’s stovepiped environment, mapping the data often requires a great deal of time from both decision workers and IT personnel.

Step 4: Accessing the data. The source data can be anywhere in the enterprise, but most often is on a network, either internal or web accessible. Connecting to that data and obtaining access often requires a number of steps requiring IT support in today’s environment.

Step 5: Integrating the data. Through the mapping process done earlier, the data from the multiple sources needs to be combined together to look like one consistent data source to the end user.

Step 6: Using the data. This is often as simple as displaying the data needed, but can also mean more sophisticated reporting, graphing, or even light analysis and decision rules. The most effective way of presenting the data is through interactive dashboards that enable a variety of presentations, data drill-down, and further exploration.

What then is required to make all of these steps into a rapid, seamless and easy-to-use right data process?

At the heart of the matter is the need for a common way of describing data no matter its source. Until recently, there wasn’t such a method, but thanks to the work of the World Wide Web Consortium (W3C) there is now an industry-standard semantic data model that can describe data from any source, in any format, including unstructured (i.e. text) data. There is also a standard for describing data ontologies (i.e., the relationships between data elements) in a way that is meaningful for the end user. Together, these semantic technologies satisfy steps 2 and 3 above and provide the basis for steps 5 and 6.

With semantic technology as a foundation, decision workers need a platform that adds in the capabilities to do steps 1 and 4. The activities required in each of these steps exist in various products that have been available for quite a while, but as standalone products that require a tremendous amount of IT effort to tie together for the end user. What is needed is one, seamless platform, that ties all the steps together to enable decision workers’ access to right data and does not require extensive IT support. The good news is there are now a number of startup companies that are building such a platform based on semantic technologies. This is a new area, with increasing attention at technology conferences and from industry analysts. Will semantics-enabled right data eclipse big data as the “it” trend in the data arena? Yes, simply because it impacts many more users and can lead to a greater level of operational efficiency across the entire enterprise.

About the author

Jeffrey Stamen is executive vice chairman of the board, Cambridge Semantics, a provider of semantic data management software for the enterprise. Cambridge Semantics’ Anzo software lets business users search for, virtualize, analyze, act on, and make decisions with any internal or external, structured or unstructured data. Based on the revolutionary flexibility of semantic web technology standards, Anzo provides ease-of-use, speed of implementation, and operational business process integration for formal or informal business activity.