Leveraging Big Data for Predictive Analytics

Bookmark and Share

It is an understatement to say we’re witnessing an example of Moore’s Law — which states the number of transistors on a chip will double approximately every two years — as we seek to manage the explosion of big data. Given the impact this new wealth of information has on hundreds of millions of business transactions, there’s an urgent need to look beyond traditional insight-generation tools and techniques. It’s critical we develop new tools and skills to extract the insights that organizations seek through predictive analytics.

According to recent studies, we’ve done an impressive job creating data — producing and replicating 1.8 zetabytes (or 1.8 trillion gigabytes) in 2011 alone. This represents a 9X increase in data produced five years earlier. In 2010, we crossed the 1 zetabyte mark for the first time by producing and replicating 1.2 trillion gigabytes of information. The number of files or containers of this information (such as photos, videos, and e-mail messages) is projected to grow 75 times, while the staff tasked with managing the information is projected to increase by only 1.5 times.

In addition, retained information will generate more transient data, which could often be much larger than the size of the data we store. For instance, photos and videos uploaded to social media sites will be downloaded and viewed many times. Thus, if you keep 1 gigabyte of content, you’ll need infrastructure to move that 1 gigabyte of data multiple times.

Implications of New Data Sources for Businesses

The constant quest to better understand business dynamics and take advantage of innovative insights has always been a challenge. Early techniques focused on understanding what happened in the past, followed by efforts to understand what is happening in real time. The most significant business advantages will stem from organizations that can predict what will happen in the near future by combining newly available data sets with traditional data sources.

Consider the insurance industry a prime example. Fifteen years ago, few insurance products were offered online; however, as the internet and e-business became a more prevalent commerce channel, several new insurance quotes and products tailored to today’s consumers became available. This activity has generated new streams of insurance data, which has been incorporated into predictive analytics efforts in search of generating competitive insights. Furthering access to insurance products, data sets once reserved for the private sector are now being opened for public use. A good example of this is The federal government has granted access to hundreds of thousands of raw and geospatial data files to the public through the portal.

Other sources of information available today include vehicle telematics data encoded with geographic information system (GIS) coordinates, social media images and videos, and radio frequency ID (RFID). Hardware devices deployed to fleets of automobiles frequently collect GIS-encoded telematics data. This data includes driver events such as sudden braking, swerving, and the roads a driver travels. Most telematics devices collect information in intervals of 1 to 10 seconds, generating immense amounts of data.

Each year, users of social media sites leave behind a trail of digital crumbs in the photos, videos, and comments they post online. These images and video files contain metadata that depicts the date and time it was created and, in some cases, even the GIS coordinates where they were taken. Since the data is hosted on social networking sites such as Twitter, Facebook, Pinterest, LinkedIn, and so forth, it can be mined to provide better context and risk understanding.

RFID technology consists of a small chip commonly embedded in merchandise and activated when the merchandise is within range of a radio signal (unlike line-of-sight UPC scanners that use lasers for reading UPC bar codes). It’s frequently used in merchandise for inventory tracking and insuring the authenticity of goods. Organizations seeking to better understand the flow of their goods through the supply chain, or even after it’s lost or stolen, can use RFID data and predictive analytics to gain valuable business insights. The new data sets are helping to provide more context and richness for analysts today, and all are contributing to the Big Data challenge.

Implications for Predictive Analytics

Access to more data does not necessarily equate to more insights — the data must be extracted. Predictive model development involves running through many iterations of the most relevant data to get the best results. It’s possible to garner results through tests on relatively small data sets in a relatively short period of time. However, when the involved data sets grow to hundreds of billions of records, the task becomes very time-consuming. In some cases, the volume of data is so large it’s impractical to use traditional analytics tools.

To solve this issue, a new class of massively parallel processing (MPP) analytics appliances has emerged. It allows users to develop and run predictive models on a single device and also hosts the data being queried. This is referred to as in-database analytics. Predictive models are deployed directly on the appliance, where the data is hosted, avoiding the slow process of moving data across networks. The performance improvements offered by these analytics appliances allow for multiple iterations or tests to be conducted in a single day on very large data sets, enabling predictive model development on hundreds of millions of records. The ability to quickly extract more precise meaning out of a lot of new data provides organizations with the information to make better business decisions faster.

In addition, many analysts share reference data sets in the development of predictive models across several lines of business. In the early days of modeling, the data sets were frequently copied to each modeler’s workstation; however, with the size of reference data sets growing into the hundreds of millions of records, this practice has become expensive and slow. It’s much easier to centralize single copies of large reference data sets on analytics appliances and have many modelers deploy their code to the appliance for fast iterative development.

Implications on Data Privacy, Sharing, and Management

Many insurance companies send their underwriting and claims data to trusted intermediaries for the purposes of centralized reporting to the federal government, actuarial rate research, and other analytics needs. The responsibility of data stewardardship is a privilege granted to a select few organizations, where data privacy, security, and usage are integral functions in the companies’ day-to-day processes and procedures. These trusted intermediaries are uniquely positioned to perform analytics in aggregate because a large data set physically resides in one location.

The Human Element

It’s necessary to establish an enterprise data management group that works closely with data owners and legal teams to determine how data can be combined and for what purposes. In addition, the enterprise data management group plays an important role in communicating what information is available within the organization, where data gaps exists, and assimilating newly acquired data assets into the organization.

The emergence of the data scientist, who plays a key role in making sense of big data within the enterprise, is an example of how businesses are evolving to address the urgent need to tame big data. The data scientist is most closely affiliated with the analytics and IT groups within organizations. Another critical aspect of the data scientist’s role is addressing large data volumes, which frequently include many duplicate records and prove to be challenging to accurately count unique people and places. Data scientists and managers will need strong entity resolution capabilities to resolve redundancy in big data — a capability yet to be developed within today’s analytics appliances.

Analytics appliances are also evolving at a rapid pace. This market segment did not exist even five years ago. Today, analytics appliance solutions include data sharing and data policy management systems that make it easier for data scientists to share large amounts of data. This is in contrast to the traditional enterprise data warehouse model, which required several months of infrastructure projects to achieve the task.

Some market leaders of analytics appliances are also leveraging the collaborative power of social media in their solutions so that data scientists can share insights with each other inside the analytics appliance. This collaborative data sharing represents fertile ground for innovation when organizations adopt such analytics appliances. Functionality in this area is evolving to handle the problem of controlling which analysts can see which data sets and which data can be combined when developing predictive models or performing analytics.


Big data analytics has matured now that we’re able to get answers in seconds or minutes — answers that once took hours, days, or were impossible to achieve using traditional analytics tools. This ability to iterate allows modelers and data scientists to understand critical insights quickly.

Opportunities exist for organizations to distinguish themselves from their competition by developing solutions that expedite predictive model deployment. The results will enable organizations to operationalize their insights at a faster pace, enhancing their ability and nimbleness to gain even more competitive insights.

Organizations that are unable to successfully address Big Data will struggle to find true meaning and direction in their vast quantities of data.

Companies that adopt modern tools such as analytics appliances will be better positioned to tackle Big Data problems introduced by using significantly larger and new data sets. Trusted data stewards with access to centralized insurance data are uniquely situated to garner valuable insights for the industry. The ability to execute on these insights is a necessary and ongoing process.


About the authors:

Perry Rotella is senior vice president and chief information officer and Nigel DeFreitas is chief application architect at Verisk Analytics, a leading source of information about risk for professionals in many fields, including property/casualty insurance, mortgage, healthcare, government, supply chain, and risk management.