Big Data for Credit Scoring: Opportunities and Challenges

By Bart Baesens

Apr 4, 2016

Throughout the past few decades, banks have gathered plenty of information describing the default behavior of their customers. Examples of data collected include historical information about a customer’s date of birth, gender, income, and employment status. Classical credit scorecards are typically constructed using data extracted from traditional transactional systems such as OLTP (online transaction processing), ERP (enterprise resource planning), and CRM (customer relationship management) applications. All this data has been nicely stored in huge (relational) databases or data warehouses.

The emergence of big data, characterized in terms of its four V’s—volume, variety, velocity, and veracity—has created both opportunities and challenges for credit scoring.

The online social graph is a recent example. Think about the major social networks such as Facebook, Twitter, LinkedIn, Weibo, and WeChat. All together, these networks capture information about close to 2 billion people, including their friends, preferences, and other behavior, creating a massive digital trail. Then, there is the Internet of Things (IoT) or the emerging sensor-enabled ecosystem that is going to connect various objects (e.g., homes and cars) with each other and with humans. Finally, we see more and more open or public information such as data about weather, traffic, maps, and macro-economy trends.

All of the above data-generating processes can be characterized in terms of the sheer volume of data that is being generated. Clearly, this poses serious challenges in terms of setting up scalable storage architectures combined with a distributed approach to data manipulation and querying.

Big data also comes in a great variety of formats. Traditional data types or structured data, such as customer name and customer birth date, are increasingly being complemented with unstructured data, including images, fingerprints, tweets, emails, Facebook pages, sensor data, and GPS data. Although the former can be easily stored in traditional (e.g., relational) databases, the latter needs to be accommodated using the appropriate database technology facilitating the storage, querying, and manipulation of each of these types of unstructured data. This requires a substantial effort since it is estimated that at least 80% of all data is unstructured.

Velocity refers to the speed at which the data is generated and needs to be stored and analyzed. Think about streaming applications such as online trading platforms, YouTube videos, SMS messages, credit card swipes, and phone calls; all are examples in which high velocity can be a key concern.

And finally, veracity pertains to the quality or trustworthiness of the data. And, unfortunately, more data does not automatically result in better data, so the quality of the data-generating process must be closely monitored and guaranteed.

As the volume, variety, and velocity of data continue to grow, along with issues regarding veracity, so do the new opportunities for building better credit scoring models. Think about Facebook or Twitter as an example. It is quite obvious that knowing a credit applicant’s hobbies, followers, friends, likes, education, and workplace could be very beneficial to better quantify his/her creditworthiness. Another useful data source concerns call detail records, or CDR data, which capture the mobile phone usage of an applicant. Also, surfing behavior could be a nice add-on.

Clearly, the availability of these big data sources is creating new opportunities as well as challenges for credit scoring. First, these sources may be useful to score customers who lack borrowing experience (because it’s their first loan or they recently moved to a new country) and would be automatically perceived as risky according to traditional credit scoring models which rely on historical information. By using alternative data sources, a better assessment of the credit risk can be made. Another example is presented by developing countries, where banks often lack historical credit information and no local credit bureaus may be available. Given the widespread use of social networks and/or mobile phones (even in developing countries!), the data gathered might be an interesting alternative to do credit scoring.

Obviously, using these data sources also presents challenges. The first one concerns privacy. It is important that customers are properly informed about what data is used to calculate their credit score. An opt-out option should always be provided. Furthermore, using social network data for credit scoring can trigger new fraud behavior whereby customers strategically construct their social network to artificially and maliciously improve their credit quality. Finally, regulatory compliance might also become an important issue. Many countries prohibit the use of gender, age, marital status, national origin, ethnicity, and beliefs for credit scoring. Since much of this information can be easily scraped from social networks, it may be harder to oversee regulatory compliance when using social network or other data for credit scoring.

Image courtesy of Shutterstock.

Big Data for Credit Scoring: Opportunities and Challenges

Newsletters

Recent Big Data Quarterly Issues

White Papers

Webinars