Predicting Credit Card Fraud

Photo by Anete Lusina from Pexels

Credit card fraud sucks! It sucks for the financial institution and it sucks for the customer. In 2018, fraudulent transactions resulted in almost $9.5 billion dollars worth of loses in the US alone!

But how can financial institutions predict if a transaction is likely to fraudulent? And how can we as consumers protect ourselves from falling victim to fraud?

Let the data speak!

This post won’t cover all the facets of the project. To see my full project, you can download it from my public KNIME Hub space.

The Data

For this exercise, I used synthetic credit card data created by the Sparkov Data Generation tool. The full data can be retrieved here. As financial information is highly sensitive, synthetic data is the most usual route for a public exercise such as this. However, the data is supposed to model real life patterns.

I used a subset of the data for training and testing my model.

The training set has 257,868 normal transactions, 1,467 fraudulent transactions, and 973 credit cards.

The testing set has 69,213 normal transactions, 251 fraudulent transactions, and 920 credit cards.

I tested my models with different subsets of the full datasets, and model performances were stable, and similar to the results I got using the full dataset.

Here are the variables that came with the original dataset.

I did not use merchant, first, last, unx_time, trans_num, and unix_time. I filtered these variables out early on in the processes.

The Tools

I used KNIME for all my data prepping, feature engineering, and model creation. KNIME is a phenomenal open source end-to-end data science platform. I used Power BI to visualize patterns in the fraud and non-fraud cases. Power BI is a powerful BI and visualization tool from Microsoft.

Data Prepping

The data was relatively clean, and free from missing values. Data prepping revolved around transforming the variables into the appropriate data types needed for my analysis.

I used the below KNIME workflow pictured for my preliminary data type transformations.

Feature Engineering

I had a lot of fun here!

I created new features bases on hunches, my experience working in the financial industry, and general curiosity. I figured that the time of day for instance would play a role in fraud patterns. Fraudsters should be more active when we are more likely to be sleeping and less alert.

For the credit card (CC) specific variables, I extracted these because I know that the 1st digit of the card tells you what network it’s on (VISA starts with 4 for example), and the first 6-digits tells you the merchant, issuer, and card type. I was curios if the housing_type and job of the customer played a role in fraud, so I extracted these as well.

I also created features at the card level. These include the date of the card’s first transaction in the dataset, whether or not the card had an earlier fraud case, the number of months the card has been in the dataset, and the frequency of transactions in minutes.

Dealing with Class Imbalance

While fraud cases are a pain, thankfully they are also rare.

This creates a challange for building predictive models though, as data with class imbalance usually led to models with poor performance.

To address this, I leveraged the Equal Size Sampling node in KNIME. This node selects an equal amount of non-fraud cases to ensure that the ratio of fraud to non-fraud data is 1:1.

So, armed with 2,934 rows of perfectly balanced data, I was ready to create some fraud predictive algorithms!

Algorithm Creation, Feature Selection, and Cross-Validation

I created five models to predict credit card fraud, however I tossed out the Logistic Regression Model, and the Random Forest Model as they did not performed well early on.

I instead focused on the Decision Tree, XGboost Tree Ensemble, and Gradient Boost Tree models. For each of the models, I ran a cross-validation loop (on the training data), inside of a feature selection loop. Here is an image of what that looks like.

The x-aggregator spliced the data up into 10 different pieces, and tested the model on each of these chunks of data, measuring the error in each case.

The Feature Selection Loop End node, keeps track of the amount of accuracy gained by adding a new feature to the model. Note that I held the amount constant, so this was already included as a base predictive variable. Here is an image of the accuracy table with each new added feature.

Finally, the feature selection loop showcases the combination of features which result in the best model performance.

All the images above are from my chosen model, the XGboost model. To see the details for the other two models, check out this project on my KNIME Hub.

Model Selection

After feature selection and cross-validation, I tested the models on the testing dataset. Here are the accuracy statistics for all three models.

Decision Tree
XGboost Tree Ensemble
Gradient Boosted Trees

From the results above, we can see that the XGboost model only missed 6 out of the 251 fraud cases, performing better that the decision tree and gradient boosted model when it comes to limiting false negatives, predictions of no fraud when there is in fact fraud.

The XGboost performs worse than the Gradient Boosted model when it comes to false positives, predictions of fraud when there is in fact no fraud.

A missed fraud case is far more expensive and stressful that an assumed fraud case. When there is fraud, the bank has to investigate the case, loop in the merchant, stress out the customer, involve the authorities, try to recover the stolen money, pay for it out of pocket, or pass the liability to the network, merchant, or customer. Not fun! 😱

When a transaction is suspected as being fraudulent, it can be blocked until the customer can verify the legitimacy of the transaction. This usually take a few minutes, which is nothing compared to the months and potential losses a missed fraudulent transaction can cause.

For this reason, I favor the XGboost model. 🥇

Which Variables are Associated with a Higher Fraud Likelihood?

The variables that went into the final XGboost model, as prescribed by the feature selection process are:

Transaction amount

Transaction category

Transaction hour

Age of the card holder

I made some simple visuals, using the full training file, to explore how these variables relate to fraud.

Amount and Fraud

Amount and Fraud

Notice that normal transactions appear to be heavily centered around a few dollars up to just over $5,000. There are a few of big ticket transactions over $10k.

For fraudulent transactions, these are most heavily concentrated around $600 to $1,100.

Transaction Category and Fraud

Category and Fraud

Fraudsters love to commit their dirty crimes with online shopping! The lack of a human to agent to spot any suspicious behavior, or seek ID verification makes it easier to make a fraudulent transaction. Hence it is not surprising that shopping_net, online shopping, is the top category of fraudulent transactions when it is not even top five for regular ones.

Consequently, customers should be careful about which online retailers and website they expose their credit card details to. Here are some tips on shopping safely online.

Time and Fraud

Hour and Fraud

Regular transactions tend to occur in the afternoon and night. Fraudulent transactions usually occur at night to early morning. Criminals like to hide their crimes in the shadow of darkness, when people are less vigilant and responsive to transaction alerts, and perhaps more careless with their credit cards.

Let’s observe the interplay between the transaction hour and category.

Hour, Category and Fraud

Online fraud tends to take place late at night, while grocery point of sale fraud are predominant from midnight to 3am. Cashiers should be more vigilant an aware of the added risk of fraud during these times.

Age and Fraud

It is heart breaking, but dubious characters target older individuals for scams and fraud.

Age and Fraud

I used Power BI’s auto bin function to create my age bins. Each labeled bin included its age, as well as the next 14 years ahead. So bin 15 consists of ages 15 to 30.

Notice how less people between the ages of 15 to 45 are present in the fraudulent set, but people from 45 to 75 make up a greater proportion of fraud victims. It appears that the older you are, the more likely you are to be a victim of credit card fraud.

The Take Away

This has been a fun and exciting exercise, but since the data set I used is synthetic these findings should be taken conservatively. But they do seem to follow trends around credit card fraud.

While the data is synthetic, I think my approach is scientific and solid. Hence, if you work in credit card fraud detection, I’ll encourage you to apply my workflow and methods to your dataset. If you do, do let me know what you find! 😀

That’s all for now guys!

See you on the blog!

P.s Forgive any typos you may find. It’s been a long week!

-Tosin Adekanye-


Published by Tosinlitics

Hello! I'm Tosin and I love analyzing stuff and using data science as a crystal ball. Follow me to see my cool dashboards, data science, and analytics projects.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: