Improving the ROI of Big Data and Analytics

By Bart Baesens

May 15, 2017

Big data and analytics are all around these days. Most companies already have their first analytical models in production and are thinking about further boosting their performance. However, far too often, these companies focus on the analytical techniques rather than on the key ingredient: data. The best way to boost the performance and ROI of an analytical model is by investing in new sources of data which can help to further unravel complex customer behavior and improve key analytical insights.

Let’s explore the various types of data sources that could be worth pursuing in order to squeeze more economic value out of your analytical models.

A first option concerns the exploration of network data by carefully studying relationships between customers. These relationships can be explicit or implicit. Examples of explicit networks are calls between customers, shared board members between firms, and social connections, such as family and friends. Explicit networks can be readily distilled from underlying data sources, such as call logs, and their key characteristics can then be summarized using “featurization” procedures resulting in new characteristics which can be added to the modeling dataset. Research has shown network data to be highly predictive for both customer churn prediction and fraud detection.

However, implicit networks or pseudo networks are more challenging to define and featurize. In one study, a network of customers was built in which links were defined based upon which customers transferred money to the same entities (e.g., retailers) using data from a major bank. When combined with non-network data, this way of defining a network based upon similarity instead of explicit social connections gave a better lift and generated more profit for almost any targeting budget. In another award-winning study, a geosimilarity network was built among users based upon location-visitation data in a mobile environment. In this model, two devices are considered similar, and thus connected, when they share at least one visited location. They are more similar if they have more shared locations, as these are visited by fewer people. This implicit network can then be leveraged to target advertisements to the same user on different devices or to users with similar tastes, thus improving online interactions. Both of these examples illustrate the potential of implicit networks as an important data source. A key challenge here is to creatively think about how to define these networks based upon the goal of the analysis.

Data is often branded as the new oil. Hence, firms, such as Equifax, Experian, Moody’s, S&P, Nielsen, and Dun & Bradstreet, capitalize on this by gathering various types of data, analyzing them in innovative and creative ways, and selling the results thereof. These firms consolidate publically available data, data scraped from websites or social media, survey data, and data contributed by other firms. By doing so, they can perform all kinds of aggregated analyses, build generic scores, and sell these to interested parties. Because of the low-entry barrier in terms of investment, externally purchased analytical models are sometimes adopted by smaller firms to take their first steps in analytics. In addition to commercially available external data, open data—such as industry and government data, weather data, news data, and search data—can also be a valuable source of information. Both commercial and open external data can boost the performance and the economic return of an analytical model.

Macro-economic data is another source of information. Many analytical models are developed using a snapshot of data at a particular moment in time. This is obviously conditional on the external environment at that moment. Macro-economic up- or downturns can have an impact on the performance and, thus, the ROI of the analytical model. The state of the macro-economy can be summarized using measures such as gross domestic product, inflation, and unemployment. Incorporating these effects enables further improvement of the performance of analytical models and makes them more robust against external influences.

Another type of data to consider is textual data. Examples are product reviews, Facebook posts, Twitter tweets, book recommendations, complaints, and legislation. Textual data is difficult to process analytically since it is unstructured and cannot be directly represented into a matrix format. Moreover, this data depends upon linguistic structure and is typically quite “noisy” due to grammatical or spelling errors, synonyms, and homographs. However, this type of data can contain very relevant information for an analytical modeling exercise. Just as with network data, it is important to find ways to featurize
text documents and combine them with other structured data. A popular way of doing this is by using a document term matrix indicating what terms appear, and how frequently, in which documents. Such a matrix will be large and sparse. Dimension reduction will thus be very important, making it necessary to represent every term in lowercase; remove terms which are uninformative, such as stop words and articles; use synonym lists to map synonym terms to one single term; stem all terms to their root; and remove terms that only occur in a single document.

Even after these activities have been performed, the number of dimensions may still be too large for practical analysis. Singular value decomposition (SVD) offers a more advanced way to do dimension reduction. SVD works in a way that is similar to principal component analysis (PCA) and summarizes the document term matrix into a set of singular vectors, also called latent concepts, which are linear combinations of the original terms. These reduced dimensions can then be added as new features to an existing, structured dataset.

Besides textual data, other types of unstructured data such as audio, images, videos, fingerprint, GPS, and RFID data can be considered. To successfully leverage these types of data in analytical models, it is critical to carefully think about creative ways of featurizing them. When doing so, it is recommended that any accompanying metadata is taken into account. For example, in fraud detection, not only an image may be relevant but also who took it, where, and at what time.

The bottom line is that the best way to boost the performance and ROI of analytical models is by investing in data first. And remember that alternative data sources can contain valuable information about the behavior of customers.

Improving the ROI of Big Data and Analytics

Newsletters

Recent Big Data Quarterly Issues

White Papers

Webinars