Newsletters




What is Data Analysis and Data Mining?


The exponentially increasing amounts of data being generated each year make getting useful information from that data more and more critical. The information frequently is stored in a data warehouse, a repository of data gathered from various sources, including corporate databases, summarized information from internal systems, and data from external sources. Analysis of the data includes simple query and reporting, statistical analysis, more complex multidimensional analysis, and data mining.

Data analysis and data mining are a subset of business intelligence (BI), which also incorporates data warehousing, database management systems, and Online Analytical Processing (OLAP). 

The technologies are frequently used in customer relationship management (CRM) to analyze patterns and query customer databases. Large quantities of data are searched and analyzed to discover useful patterns or relationships, which are then used to predict future behavior.

Some estimates indicate that the amount of new information doubles every three years. To deal with the mountains of data, the information is stored in a repository of data gathered from various sources, including corporate databases, summarized information from internal systems, and data from external sources. Properly designed and implemented, and regularly updated, these repositories, called data warehouses, allow managers at all levels to extract and examine information about their company, such as its products, operations, and customers' buying habits.

With a central repository to keep the massive amounts of data, organizations need tools that can help them extract the most useful information from the data. A data warehouse can bring together data in a single format, supplemented by metadata through use of a set of input mechanisms known as extraction, transformation, and loading (ETL) tools. These and other BI tools enable organizations to quickly make knowledgeable business decisions based on good information analysis from the data.

Analysis of the data includes simple query and reporting functions, statistical analysis, more complex multidimensional analysis, and data mining (also known as knowledge discovery in databases, or KDD). Online analytical processing (OLAP) is most often associated with multidimensional analysis, which requires powerful data manipulation and computational capabilities.

With the increasing data being produced each year, BI has become a hot topic. The increasing focus on BI has caused a number of large organizations have begun to increase their presence in the space, leading to a consolidation around some of the largest software vendors in the world. Among the notable purchases in the BI market were Oracle's purchase of Hyperion Solutions; Open Text's acquisition of Hummingbird; IBM's buy of Cognos; and SAP's acquisition of Business Objects.

Definition

The purpose of gathering corporate information together in a single structure, typically an organization's data warehouse, is to facilitate analysis so that information that has been collected from a variety of different business activities may be used to enhance the understanding of underlying trends in their business. Analysis of the data can include simple query and reporting functions, statistical analysis, more complex multidimensional analysis, and data mining. OLAP, one of the fastest growing areas, is most often associated with multidimensional analysis. According to The BI Verdict (formerly The OLAP Report), the definition of the characteristics of an OLAP application is "fast analysis of shared multidimensional information.

Data warehouses are usually separate from production systems, as the production data is added to the data warehouse at intervals that vary, according to business needs and system constraints. Raw production data must be cleaned and qualified, so it often differs from the operational data from which it was extracted. The cleaning process may actually change field names and data characters in the data record to make the revised record compatible with the warehouse data rule set. This is the province of ETL.

A data warehouse also contains metadata (structure and sources of the raw data, essentially, data about data), the data model, rules for data aggregation, replication, distribution and exception handling, and any other information necessary to map the data warehouse, its inputs, and its outputs. As the complexity of data analysis grows, so does the amount of data being stored and analyzed; ever more powerful and faster analysis tools and hardware platforms are required to maintain the data warehouse.

A successful data warehousing strategy requires a powerful, fast, and easy way to develop useful information from raw data. Data analysis and data mining tools use quantitative analysis, cluster analysis, pattern recognition, correlation discovery, and associations to analyze data with little or no IT intervention. The resulting information is then presented to the user in an understandable form, processes collectively known as BI. Managers can choose between several types of analysis tools, including queries and reports, managed query environments, and OLAP and its variants (ROLAP, MOLAP, and HOLAP). These are supported by data mining, which develops patterns that may be used for later analysis, and completes the BI process.

Business Intelligence Components

The ultimate goal of Data Warehousing is BI production, and analytic tools represent only part of this process. Three basic components are used together to prepare a data warehouse for use and to develop information from it, including:

  • ETL tools, used to bring data from diverse sources together in a single, accessible structure, and load it into the data mart or data warehouse.
  • Data mining tools, which use a variety of techniques, including neural networks, and advanced statistics to locate patterns within the data and develop hypotheses.
  • Analytic tools, including querying tools and the OLAP variants, used to analyze data, determine relationships, and test hypotheses about the data.

Analytic tools continue to grow within this framework, with the overall goal of improving BI, improving decision analysis, and, more recently, promoting linkages with business process management (BPM), also known as workflow.

Data Mining

Data mining can be defined as the process of extracting data, analyzing it from many dimensions or perspectives, then producing a summary of the information in a useful form that identifies relationships within the data. There are two types of data mining: descriptive, which gives information about existing data; and predictive, which makes forecasts based on the data.

Basic Requirements

A corporate data warehouse or departmental data mart is useless if that data cannot be put to work. One of the primary goals of all analytic tools is to develop processes that can be used by ordinary individuals in their jobs, rather than requiring advanced statistical knowledge. At the same time, the data warehouse and information gained from data mining and data analysis needs to be compatible across a wide variety of systems. For this reason, products within this arena are evolving toward ease of use and interoperability, though these have become major challenges.

For all analytic tools, it is important to keep business goals in mind, both in selecting and deploying tools and in using them. In putting these tools to use, it is helpful to look at where they fit into the decision-making processes. The five steps in decision-making can be identified as follows:

  • Develop standard reports.
  • Identify exceptions; unusual situations and outcomes that indicate potential problems or advantages.
  • Identify causes of the exceptions.
  • Develop models for possible alternatives.
  • Track effectiveness.

Standard reports are the results of normal database queries that tell how the business is performing and provide details of key business factors. When exceptions occur, the details of the situation must be easily obtainable. This can be done by data mining, or by developing hypotheses and testing them using analytic tools such as OLAP. The conclusions can then be tested using "what-if" scenarios with simple tools such as spreadsheet applications. When a decision is made, and action is taken, the results must then be traced so that the decision-making process can be improved.

Although sophisticated data analysis may require the help of specialized data analysts and IT staff, the true value of these tools lies in the fact that they are coming closer to the user. The "dashboard" is becoming the leading user interface, with products such as Informatica's PowerCenter, Oracle's Hyperion Essbase, SAS Enterprise Miner and Arcplan Enterprise server tools designed to provide easily customizable personal dashboards.

One of the recurring challenges for data analysis managers is to disabuse executives and senior managers of the notion that data analysis and data mining are business panaceas. Even when the technology might promise valuable information, the cost and the time required to implement it might be prohibitive.

The 12 Rules

In 1993, E.F. Codd, S.B. Codd, and C.T. Salley presented a paper entitled "Providing OLAP (On-line Analytical Processing) to User-Analysts: An IT Mandate" that offered 12 rules for evaluating analytical processing tools. These rules are essentially a list of "must haves" in data analysis, focusing on usability, and they continue to be relevant in evaluating analytic tools:

  • Multidimensional Conceptual View.
  • Transparency.
  • Accessibility.
  • Consistent Reporting Performance.
  • Client/Server Architecture.
  • Generic Dimensionality.
  • Dynamic Sparse Matrix Handling.
  • Multi-user Support.
  • Unrestricted Cross-Dimensional Operations.
  • Intuitive Data Manipulation.
  • Flexible Reporting.
  • Unlimited Dimensions and Aggregation Levels.

Since analytic tools are designed to be used by, or at the very least, their output understood by, ordinary employees, these rules are likely to remain valid for some time to come.

Current View

The analytic sector of BI can be broken down into two general areas: query and analysis and data mining. It is important to bear in mind the distinction, although these areas are often confused. Data analysis looks at existing data and applies statistical methods and visualization to test hypotheses about the data and discover exceptions. Data mining seeks trends within the data, which may be used for later analysis. It is, therefore, capable of providing new insights into the data, which are independent of preconceptions.

Data Analysis

Data analysis is concerned with a variety of different tools and methods that have been developed to query existing data, discover exceptions, and verify hypotheses. These include:

Queries and Reports. A query is simply a question put to a database management system, which then generates a subset of data in response. Queries can be basic (e.g., show me Q3 sales in Western Europe) or extremely complex, encompassing information from a number of data sources, or even a number of databases stored within dissimilar programs (e.g., a product catalog stored in an Oracle database, and the product sales stored under Sybase). A well-written query can exact a precise piece of information; a sloppy one may produce huge quantities of worthless or even misleading data.

Queries are often written in structured query language (SQL), a product-independent command set developed to allow cross-platform access to relational databases. Queries may be saved and reused to generate reports, such as monthly sales summaries, through automatic processes, or simply to assist users in finding what they need. Some products build dictionaries of queries that allow users to bypass knowledge of both database structure and SQL by presenting a drag-and-drop query-building interface. Query results may be aggregated, sorted, or summarized in many ways. For example, SAP's Business Objects unit offers a number of built-in business formulas for queries.

The presentation of the data retrieved by the query is the task of the report. Presentations may encompass tabular or spreadsheet-formatted information, graphics, cross tabulations, or any combination of these forms. A rudimentary reporting of products might simply show the results in a comprehensible fashion; more elegant output is usually advanced enough to be suitable for inclusion in a glossy annual report. Some products can run queries on a scheduled basis and configure those queries to distribute the resulting reports to designated users through email. Reporting products routinely produce HTML output and are often accessible through a user's Web browser.

Managed Query Environments. The term managed query environment has been adopted by the industry to describe a query and reporting package that allows IT control over users' access to data and application facilities in accordance with each user's level of expertise and business needs. For example, in some organizations, IT may build a set of queries and report structures and require that employees use only the IT-created structures; in other organizations, and perhaps within other areas of the same organization, employees are permitted to define their own queries and create custom reports.

A managed report environment (MRE) is a type of managed query environment. It is a report design, generation, and processing environment that permits the centralized control of reporting. To users, an MRE provides an intelligent report viewer that may contain hyperlinks between relevant parts of a document or allow embedded OLE objects such as Excel spreadsheets within the report. MREs have familiar desktop interfaces; for example, SAP's Business Objects tabbed interface allows employees to handle multiple reports in the same way they would handle multiple spreadsheets in an Excel workbook.

Some MREs, such as Information Builders' FOCUS Report Server, can handle the scheduling and distribution of reports, as well as their processing. For example, SAP Business Object's Crystal Reports can develop reports about previously created reports.

Online Analytical Processing (OLAP). The most popular technology in data analysis is OLAP. OLAP servers organize data into multidimensional hierarchies, called cubes, for high-speed data analysis. Data mining algorithms scan databases to uncover relationships or patterns. OLAP and data mining are complementary, with OLAP providing top-down data analysis and data mining offering bottom-up discovery.

OLAP tools allow users to drill down through multiple dimensions to isolate specific data items. For example, a hypercube (the multidimensional data structure) may contain sales information categorized by product, region, salesperson, retail outlet, and time period, in both units and dollars. Using an OLAP tool, a user need only click on a dimension to see a breakdown of dollar sales by region; an analysis of units by product, salesperson, and region; or to examine a particular salesperson's performance over time.

Information can be presented in tabular or graphical format and manipulated extensively. Since the information is derived from summarized data, it is not as flexible as information obtained from an ad hoc query; most tools offer a way to drill down to the underlying raw data. For example, PowerPlay provides the automatic launch of its sister product, Impromptu, to query the database for the records in question.

Although each OLAP product handles data structures and manipulation in its own way, an OLAP API, developed by a group of vendors who form the OLAP Council, standardizes many important functions and allows IT to offer the appropriate tool to each of its user groups. The MD-API specifies how an OLAP server and client connect, and it defines metadata, data fetch functions, and methods for handling status messages. It also standardizes filter, sort, and cube functions; compliant clients are able to communicate with any vendor's compliant server.

OLAP Variants: MOLAP, ROLAP, and HOLAP. OLAP is divided into multidimensional OLAP (MOLAP), relational OLAP (ROLAP), and hybrid OLAP (HOLAP).

ROLAP can be applied both as a powerful DSS product, as well as to aggregate and pre-stage multi-dimensional data for MOLAP environments. ROLAP products optimize data for multi-dimensional analysis using standard relational structures. The advantage of the MOLAP paradigm is that it can natively incorporate algebraic expressions to handle complex, matrix-based analysis. ROLAP, on the other hand, excels at manipulating large data sets and data acquisition, but is limited to SQL-based functions. Since all organizations will require both complex analysis and analysis of large data sets, it could be necessary to develop an architecture and set of user guidelines that will enable implementation of both ROLAP and MOLAP where each is appropriate.

HOLAP is the newest step in the ongoing evolution of OLAP. HOLAP combines the benefits of both ROLAP and MOLAP by storing only the most often used data in multidimensional cube format and processing the rest of the relational data in the standard on-the-fly method. This provides good performance in browsing aggregate data, but slower performance in "drilling down" to further detail.

Data Mining

Databases are growing in size to a stage where traditional techniques for analysis and visualization of the data are breaking down. Data mining and KDD are concerned with extracting models and patterns of interest from large databases. Data mining can be regarded as a collection of methods for drawing inferences from data. The aims of data mining and some of its methods overlap with those of classical statistics. It should be kept in mind that both data mining and statistics are not business solutions; they are just technologies. Additionally, there are still some philosophical and methodological differences between them.

This field is growing rapidly, due in large part to the increasing awareness of the potential competitive business advantage of using such information. Important knowledge has been extracted from massive scientific data, as well. What is useful information depends on the application. Each record in a data warehouse full of data is useful for daily operations, as in online transaction business and traditional database queries. Data mining is concerned with extracting more global information that is generally the property of the data as a whole. Thus, the diverse goals of data mining algorithms include: clustering the data items into groups of similar items, finding an explanatory or predictive model for a target attribute in terms of other attributes, and finding frequent patterns and sub-patterns, as well as finding trends, deviations, and interesting correlations between the attributes.

A problem is first defined, then data source and analytic tool selection are undertaken to decide the best way to approach the data. This involves a wide variety of choices.

Decision trees and decision rules are frequently the basis for data mining. They utilize symbolic and interpretable representations when developing methods for classification and regression. These methods have been developed in the fields of pattern recognition, statistics, and machine learning. Symbolic solutions can provide a high degree of insight into the decision boundaries that exist in the data and the logic underlying them. This aspect makes these predictive mining techniques particularly attractive in commercial and industrial data mining applications.

Applying machine-learning methods to inductively construct models of the data at hand has also proven successful. Neural networks have been successfully applied in a wide range of supervised and unsupervised learning applications. Neural-network methods are not commonly used for data mining tasks because they are the most likely to produce incomprehensible results and to require long training times. Some neural-network learning algorithms exist, however, that are able to produce good models without excessive training times.

In recent years, significant interest has developed in adapting numerical and analytic techniques from statistical physics to provide algorithms and estimates for good approximate solutions to hard optimization problems. Cluster analysis is an important technique in exploratory data analysis, because there is no prior knowledge of the distribution of the observed data. Partitional clustering methods, which divide the data according to natural classes present in it, have been used in a large variety of scientific disciplines and engineering applications. The goal is to find a partition of a given data set into several compact groups. Each group indicates the presence of a distinct category in the measurements.

In all data mining applications, results are considerably subject to interpretation, since it is a search for trends and correlation rather than an examination of hypotheses based on known real-world information. The possibility for spurious results is large, and there are many cases where the information developed will be of little real value for business purposes. Nonetheless, when pay dirt is struck, the results can be extremely useful.

Interest in data mining is growing, and it has recently been spotlighted by attempts to root out terrorist profiles from data stored in government computers. In a more mundane, but lucrative application, SAS uses data mining and analytics to glean insight about influencers on various topics from postings on social networks such as Twitter, Facebook, and user forums.

Data Mining and CRM

CRM is a technology that relies heavily on data mining. Comprising sales, marketing, and service, CRM applications use data mining techniques to support their functionality. Combining the two technology segments is sometimes referred to as "customer data mining." Proponents claim that positive results of customer data mining include improvements in prospecting and market segmentation; increases in customer loyalty, as well as in cross-selling and up-selling; a reduction in risk management need; and the optimization of media spending on advertising.

Recommendations

Since data analysis is such a key method for developing knowledge from the huge amounts of business data collected and stored each day, enterprises need to select the data analysis tools with care. This will help ensure that the tools' strengths match the needs of their business. Organizations must be aware of how the tools are to be used and their intended audience. It is also important to consider the Internet, as well as the needs of mobile users and power users, and to assess the skills and knowledge of the users and the amount of training that will be needed to get the most productivity from the tools.

Visual tools are very helpful in representing complex relationships in formats that are easier to understand than columns of numbers spread across a screen. Key areas of discovery found with visual tools can then be highlighted for more detailed analysis to extract useful information. Visual tools also offer a more natural way for people to analyze information than does mental interpretation of a spreadsheet.

Organizations should also closely consider the tool interface presented to users, because an overly complex or cluttered interface will lead to higher training costs, increased user frustration, and errors. Vendors are trying to make their tools as friendly as possible, but decision-makers should also consider user customization issues, because a push-button interface may not provide the flexibility their business needs. When considering their OLAP processes, companies need to determine which approach is best. The choices include a multi-dimensional approach, a relational analysis one, or a hybrid of the two. The use of a personalized "dashboard" style interface is growing, and ease of use has emerged as a key criterion in corporate purchasing decisions.

While data analysis tools are becoming simpler, more sophisticate techniques will require specialized staff. Data mining, in particular, can require added expertise because results can be difficult to interpret and may need to be verified using other methods.

Data analysis and data mining are part of BI, and require a strong data warehouse strategy in order to function. This means that attention needs to be paid to the more mundane aspects of ETL, as well as to advanced analytic capacity. The final result can only be as good as the data that feeds the system.

Web Links

Arcplan: http://www.arcplan.com/
IBM Cognos: http://www.cognos.com/
Informatica: http://www.informatica.com/
Information Builders: http://www.informationbuilders.com/
The BI Verdict: http://www.bi-verdict.com/
Open Text: http://www.opentext.com/
OLAP Council: http://www.olapcouncil.org/
Oracle: http://www.oracle.com/
SAP BusinessObjects: http://www.sap.com/solutions/sapbusinessobjects/index.epx
SAS: http://www.sas.com/
SmartDrill: http://www.smartdrill.com/
Sybase: http://www.sybase.com/

This article was adapted from the Faulkner Information Services library of reports covering computing and telecommunications. For more information contact www.faulkner.com. To subscribe to the Faulkner Information Services visit http://www.faulkner.com/showcase/subscription.asp.


Sponsors