A Closer Look at Data: Algorithms, Analytics, and Accuracy

By Jeffrey Feinstein

Mar 13, 2018

Using Big Data for Good Requires Testing, Transparency, Precision, and Due Diligence

Big data can help drive decisions in almost every aspect of our lives—from the way financial products are underwritten, learning programs are designed, energy is used, healthcare diagnoses are made, insurance rates are assessed, commerce is conducted, and fraud is prevented—the list is infinite. As these decisions can affect individuals and society as a whole, it is compulsory that data experts apply integrity, transparency, and due diligence into the analytics that mine information and the algorithms that inform our decision making and problem solving.

The first step in an analytic development is linking information together from a variety of different sources. This is one of the core challenges of a big data program because different sources can describe the same people in very different ways. It is important for these linking algorithms to be precise in order to make sense of the data. If linking algorithms are not precise, the information provided can lead to ill-informed decisions. For instance, consider the billions of records that exist about all facets of our lives—birth records, drivers’ licenses, vehicle registrations, student records, Social Security numbers, liens and judgments, and property addresses. The information in these records is often used to validate that “you are who you say you are” and to enable decisions such as mortgage and car loan approvals and much more. If inaccurate information is associated with an individual’s file because of imprecise linking algorithms, he or she could be unfairly declined for credit products, and financial institutions could miss an opportunity to develop a relationship with a potentially good customer.

While linking data requires precision, the application of linked big data needs to predict the way humans behave, even when human behavior itself lacks precision. In these instances, it is important that predictive analytics indicate likely responses in certain situations—such as whether or not a consumer is likely to purchase a product, recover from a disease, respond to a mailing, make a donation, be in a car accident, or engage in recycling. By mining very complex information about human behavior, scientists can make new discoveries, lawyers can win cases, doctors can save lives, and businesses can optimize their processes—all for the good of society.

Three Elements to Consider When Invoking Predictive Analytics with Big Data

1-Data experts should build algorithms that are thorough in their inclusion of all relevant information, not just the pieces that support their conclusions about cause and effect.

Without proper guardrails, data can cause damage—even unintentionally. That is why it is so important for data experts to be able to explain their conclusions. In our data-soaked culture, it is now more important than ever to know what is behind the information because we have learned that information is not credible unless it can be supported. When making policy decisions based on these sorts of statistics, it is important to be aware of causation distinctions, as they are essential to fully understanding the implications of a result and recommending the best course of action.

Causation indicates that one event is the result of the occurrence of another—there is a causal relationship between the two events. Causation would seem to provide a clear path to successful problem solving. However, in the case of complex issues, whether they be societal or business-related, the task of identifying causality can be challenging. For example, while we know that losing a job can be a cause of financial hardship, it is difficult to prove this correlation because there are so many moving parts that might be associated with job loss or financial hardship. Rather than concluding that unemployment leads to hardship, we often see other mitigating circumstances in real-life applications. A debilitating health issue can lead to the inability to perform one’s job, and related medical expenses can lead to financial hardship. Confounding factors such as these can make it difficult to verify a relationship and thus identify appropriate solutions. For this reason, it often takes controlled testing, many observations to “average out the noise,” and a significant amount of time to accurately ascertain the cause of an outcome.

Conversely, correlation does not express a definitive relationship. Correlation indicates that two events can co-occur coincidentally without one being the cause of the other. For example, there is a strong correlation between the amount of outdoor swimming one does and the number of mosquito bites that one can get; but it would be erroneous to conclude that swimming causes mosquito bites, or that mosquito bites cause people to swim (in fact, the likely third factor driver is that both happen on warm, sunny days). It would therefore be short-sighted to develop a policy against swimming in summer as a means to prevent the spread of the mosquito-borne Zika virus.

Some individuals feel that the act of making correlation-based decisions is risky. However, with the vast amount of data available and the proliferation of advanced technology to mine this information, stronger correlations can now quickly be observed. When combined with intuitive human input, data can help experts better understand the tie between events and the likelihood of associations and then create effective solutions to a problem. In the example of swimming and mosquito bites, a natural solution would be the creation of waterproof insect repellent.

Because strategically useful solutions can be derived from correlational associations, we can make effective correlation-based decisions about the probability and risk of something happening if the correlation is stable, reproducible, and persists over time.

Thus, a good modeler will be able to explain and support her conclusions when they are based on correlations that stand on a firm, credible foundation. For example, there is a relationship between responsible credit behavior and the probability of being in a car accident. While there is no readily identifiable reason why the timely payment of credit cards and loans reduces the frequency of accidents, there is a strong and reliable relationship across responsible behaviors such as paying bills on time and driving defensively. Ranking insurance risk by examining other responsibility-driven behaviors can therefore be an effective solution.

2-Machine learning does not apply common sense, but people do.

The point of big data is to provide actionable information. While a data scientist can apply a variety of approaches to any statistical problem, machine learning can augment the efforts of data analysts, making models more insightful. Yet while machine learning can help build more accurate, predictive models, it does not replace human intelligence. Data-driven decisions that can affect an individual’s well-being should not be based on “autopilot analytic approaches” such as machine learning if a data expert is unable to validate the intuitive relationships behind the model.

A data scientist must be able to explain what is being measured and the strength of the correlation for scores and attributes. His work must hold up under scrutiny. A modeler must be certain that there is an intuitive link between predictors, consistency with regulations, and an absence of hidden bias. He must be able to defend the model and validate that it is fair in the context of regulatory oversight.

There is no substitute for human due diligence, as it adds credibility to work. Here are four principles to keep in mind:

In many applications, a computer still needs to be instructed that X is a predictor and Y is a target.
A human has to reason through what makes a good target variable, the data that might answer the question at hand objectively, and whether the outcome of a model is correct.
Ideas, use cases, innovation, stories, and reasonableness are human inventions that are difficult to obtain from an automated algorithm.
Machine learning and all analytic methods are tools that sit between human-led milestones, model design, and model validation to make said milestones better.

3-Data conclusions must be both ethical and logical.

In today’s world of machine learning, people are becoming more reliant upon technology. Many feel that if a machine tells them something, it must be so. Taking such an approach is a grave mistake that can have significant impact on our values, the way we conduct business, and society in general. It is important that data scientists question their algorithms and findings to ensure that actions based on those algorithms are ethical and logical.

Importantly, a data scientist needs to question not only the appropriateness of the algorithms but also the fairness of the data that was used to build the algorithm. For example, decisions related to development samples could have a dramatic effect on the conclusions that a model can make. If a U.S. government agency were attempting to determine the relationship between immigration and the increase in taxpayer costs, the government agency could source data from a variety of different sources to make this determination. A natural choice might be social welfare data because it is readily available and can provide an accurate picture of struggling immigrant populations. Using this data, a researcher might conclude that immigrants to the U.S. are struggling to make ends meet and might choose to advise a course of action against immigration. However, this conclusion would be faulty because the researcher based a conclusion solely on data from struggling immigrants. Had data been obtained from other sources, such as college graduation records, the researcher might have made a different conclusion and suggested a different course of action. In the end, the researcher—not the machine—has a responsibility to use data sources that will drive accurate and unbiased conclusions.

Today, data scientists use high-speed computing, big data, and algorithms to generate predictive algorithms in minutes as opposed to the days it would take to build robust models using traditional techniques. While algorithms may point in a given direction and are logical, the application of the algorithm must be based on an accurate foundation and applied in a fair manner. There is no replacement for intuition and testing. Humans are required to make the moral choices because machines are unable to do so.

If a modeler were to use only the data that supports a hypothesis that is convenient for that modeler, it could lead to inaccurate conclusions and drive actions that are not in the best interest of those affected by the algorithm—even if the resulting model that was built from the underlying datasets is logically consistent.

It could be argued that, at some level, any given collection of attributes that are taken together could result in potential bias. Therefore, best practices argue for responsible oversight and governance principles to ensure that there is no overt or inadvertent bias making its way into the model. This way, regardless of whether a machine or human constructs the model, there is oversight of the appropriateness of the model.

Given the complexities of data analytics, explaining data and proving algorithms require scrutiny. Data experts must exercise human judgment when utilizing machine-assisted model construction. Experts must be able to stand behind their work.

Big Data for Good Is Good for Us All

When one considers the sheer amount of data and everything else that is required to produce successful outcomes—algorithms, accuracy, precision, correlation, analytics, machine learning—it’s easy to see what an important role data science plays and its value to society.

Over the years, the application of data and analytics has had a benevolent effect on society in countless ways. Today:

health plans maintain accurate network directories that are necessary for consumers to make informed decisions and to find the right type of care when needed.
hospitals can predict a patient’s risk for poor outcomes due to external environmental factors and can identify when someone is in need of more support or community resources.
police departments can forecast violent gun crimes using cutting-edge predictive analysis when identifying trends.
the IRS uses investigative solutions to help identify potentially fraudulent tax refund requests. (In fact, fraud is being mitigated at almost every level in every industry.)
telematics can enable insurance companies to better understand their customers, and consumers can learn more about their driving habits and ultimately get better insurance rates. As a result, 70% of Americans can get cheaper insurance.
front-end identity authentication systems can verify if a person receiving government benefits is who he says he is, preventing fraud and saving millions in taxpayer dollars.
credit has been extended to millions of individuals who are unbanked and do not have a credit history because credit agencies now have the ability to consider significantly more datapoints when assessing credit worthiness.

When used properly, big data can keep consumers safe, streamline processes and help experts uncover suspicious activity and can provide valuable insights for business decisions. There are countless benefits to society when big data is used for benevolent purposes.