The rise of big data with new sources of data for analytics represents new opportunity to put data to work in organizations for a wide range of uses. A developing use case for leveraging data analytics on large datasets is fraud discovery.
GOTCHA! is a new fraud detection technique developed by researchers in the U.S. and Belgium that uses social network-based analysis.
To illustrate the technique, let’s look at real-life application of network analysis for tax avoidance detection. The data under consideration is governmental tax data where companies try to avoid contributing their tax obligations.
Companies use resources to perform their activities. Resources may include buyers, suppliers, customers, employees, machinery, and accountants, for example. However, for reasons of interpretability, we do not distinguish between the different types of resources in the analysis.
A sample network is shown in the figure at left. Companies and resources are represented by circles and wedges, respectively. Given the two types of nodes, this network is referred to as a bipartite graph. Rather than connecting the companies directly to each other by means of the resources they share, the resources are included in the network as separate entities. By doing so, the network reflects reality more closely and provides more details about the intensity of the connections between companies and their resources.
Based upon tax expert input, we understand that companies can set up illegal constructions to avoid tax contributions by closely working together to perpetrate fraud. More specifically, companies that are part of a fraudulent setup are organized in such a way that they do not gain enough profit to pay taxes to the government. The real profit is pruned away by other companies in the setup. Once the company files for bankruptcy and is not capable of continuing its activities, the resources of the company are moved toward other existing companies, or to newly founded companies. The use of network analysis allows the trail of these resources to be followed from one bankrupt company to another, in order to uncover the fraudulent setups in the network. Those fraudulent setups are the so-called spider constructions. Basically, the companies in the spider construction form a web of fraud and are all closely connected to each other by means of the resources they share or transfer. The analytical task at hand then becomes as follows: Given a network of companies and resources, how can we use the label of a few confirmed fraudulent companies to infer a fraud probability for all the other companies in the network?
GOTCHA! is a fraud detection technique designed to find individual company fraud by combining evidence from two different sources: the isolated environment of the company and the relations among other companies. GOTCHA! combines three types of data:
Local attributes—These attributes comprise the characteristics of each company examined in isolation. Example attributes include the age of the company, the sector in which the company is operating, financial statement information, and its address.
Direct network attributes—These attributes characterize the direct neighborhood or EgoNet of a company. Recall that the Social Security network is a bipartite graph of companies and resources. The direct neighborhood of a company is the company together with its current (and its past) resources. Remember, however, that fraud is only attributed to certain companies and that the resources and (yet) legitimate companies are initially unlabeled.
Indirect network attributes—A propagation algorithm is used to infer a fraud probability for each unlabeled resource and company. The propagation algorithm is based on Google’s PageRank and treats fraud as a virus moving through the network. A node in the network that is highly exposed to fraud receives a high exposure score.
Local attributes are extracted from so-called factual and historical datasets. A factual dataset reflects the current situation of each observation. Changes in the factual dataset are kept in the historical data sets. Historical datasets log the previous states of the factual dataset. The network attributes are distilled from transactional data which comprise the interactions of a company with other companies by means of their resources. The overall architecture of GOTCHA! is presented in the below figure.
The local, direct and indirect network attributes are then fed to an analytical model (random forests in our case) which will output the fraud probabilities.
As suspicious cases are typically inspected manually, only a limited number of companies can be identified by GOTCHA! for follow-up analysis.
In our tax avoidance setting, GOTCHA! could only output 100 cases, of which 71% turned out to be fraudulent. From the analysis, it became clear that the social network information significantly contributed to this performance.
Given its generic setup, GOTCHA! can easily be mapped to other settings, such as credit card fraud detection. The network in credit card transaction fraud is also a bipartite graph connecting merchants with credit card holders by means of transactions. The local attributes are again enriched with both direct and indirect network attributes. An interesting approach in credit card fraud is to derive local attributes using the RFM framework (recency—frequency—monetary). Using past transactional behavior of a customer, RFM assesses for each transaction the a) recency or the time passed since the previous transaction, b) frequency or the number of transactions pursued and c) monetary value or the average amount of past transactions. A good approach is to derive the RFM variables at different levels of aggregation: merchant level, merchant category level, global level, country level, and currency level. Applied to the frequency dimension, this means that one counts the number of transactions made at the mer- chant, the merchant category (for example, groceries), over- all, the country (for example, Belgium), and the currency (for example, the Euro). GOTCHA! then uses both the local and network attributes to learn a fraud classifier. In our empirical analysis, it was capable of reaching an accuracy of up to 98%. Again, the social network information turned out to be highly significant.
For detection of tax avoidance as well as credit card fraud, social network information is an interesting add-on to local information to further boost the performance of an analytical fraud model. An additional area of application may include money laundering situations, in which people can be linked to cash transactions.
The GOTCHA! network-based fraud detection framework was developed by Véronique Van Vlasselaer, Tina Eliassi-Rad, Leman Akoglu, Monique Snoeck, and Bart Baesens.