Credit Card Fraud Detection

11 min readDec 3, 2019

Co-authors: Marcos Junior and Karine Costa

Nowadays, most of our financial transactions are virtual. Naturally, credit card frauds are a big problem [Joan Weber]. People only use virtual payment method because it can be trusted, therefore it is essential to find solutions that bring more security to credit cards users. Current credit card transaction volume make impossible the task of analysing each transaction by a human. However, machine learning models make this task possible.

In Kaggle, there is a dataset containing transactions made by credit cards in September 2013 by European cardholders. This data was already studied in others notebooks in Kaggle [Janio Martinez, Joparga].

In this article we will provide an extensive analysis regarding what are the best options to deal with imbalanced datasets and how they apply in the studied dataset. Furthermore, we will focus on testing out the model’s performance using the entire dataset and using two common techniques to deal with unbalanced datasets called Random Undersampling and SMOTE (Synthetic Minority Over-sampling Technique). You may access the code used to provide the results in this article and some other comments at Kaggle.

Dataset

Our dataset has 30 features, from which 28 were anonymized using PCA transformation (Dimensionality Reduction technique) to preserve confidentiality. The only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction amount.

The dataset is highly unbalanced. There are 492 fraudulent transactions, which represents only 0.172% of all transactions! The image below illustrates this difference of samples in the fraudulent and non-fraudulent classes.

Metrics

Before we start to build our model, it is important to describe the metrics that we used in this work to evaluate our model. The labels of the confusion matrix represents the following configurations:

True Positive (TP): Fraudulent transactions predicted as fraudulent.
True Negative (TN): Non-fraudulent transaction predicted as non-fraudulent.
False Positive (FP): Non-fraudulent transactions predicted as fraudulent.
False Negative (FN): Fraudulent transactions predicted as non-fraudulent.

Accuracy

Usually, many machine learning projects employs accuracy as the metric to evaluate its model. This metric is mathematically defined as follows:

Accuracy = (TP+TN)/(TP+TN+FP+FN)

The accuracy represents the fraction of predictions that the model got right. While it might seem to be a good metric to evaluate how well a model performs, there is a huge downside when using it in this unbalanced dataset [Randy Macaraeg].

Our dataset has precisely 284,807 samples, in which 284,315 are non-fraudulent. If we simply define all the transactions as non-fraudulent, we would have an accuracy of 99.82%! Wonderful, isn’t it?

No! The problem is that we would miss every fraudulent transaction prediction, which is the main goal of the project. Thus, this metric is not advisable for unbalanced datasets and will not be used in this project.

Recall

Recall = (TP)/(TP+FN)

We may understand the recall value as the percentage of fraudulent transactions that were correctly identified by our model. Ideally, we want our model to detect every fraudulent transaction (recall equals to one), so if we simply label all the transactions as fraudulent, the recall is going to be one. However, it is clear that this is not a good approach, since all the non-fraudulent transactions would be incorrectly labelled. Therefore, we can not use only recall to evaluate our model.

Precision

Precision = (TP)/(TP+FP)

We may understand the precision value as the percentage of all the transactions predicted to be fraudulent that are actually fraudulent. If we label only one transaction to be fraudulent and it is fraudulent, our precision will be equals to one. However, in the example described in the recall section (label all the transactions as fraudulent), the precision value would be very low.

This way, we can’t consider only recall or precision to evaluate our model since each of these metrics has configurations that may mislead the model performance. Therefore, we must consider both of them simultaneously.

G-mean

G-mean = √Sensitivity*Specificity

Another interesting metric that will be used in this project is the g-mean score. The Geometric Mean (G-Mean) is a metric that measures the balance between classification performances on both the majority and minority classes. A low G-Mean is an indication of a poor performance in the classification of the positive cases even if the negative cases are correctly classified as such. This measure is important in the avoidance of overfitting the negative class and underfitting the positive class [Josephine S Akosa].

Scaling

As previously stated, there are two features that are not PCA scalled: ‘Time’ and ‘Amount’. In order to normalize both features within a particular range, we used the RobustScaler function from scikit-learn, which removes the median and scales the data according to the quantile range.

It is interesting to highlight the importance of scaling in machine learning pipelines. If we do not normalize, the high magnitude range of the unscaled features may have a higher weight in the modeling process than features with low magnitude range. More information related to the importance of scaling may be found in [Sudharsan Asaithambi]

Building a model with the Original Dataset

In order to find the best model for our project, we have made many different model configurations, which includes selecting the most correlated features with the target class, testing different classifiers and outliers removal.

Correlation Analysis

Correlation matrices are the essence of understanding our data. We want to know if there are features that influence heavily in whether a specific transaction is a fraud or not. The image below shows a heat map of the correlation between each feature in the original dataset.

It is interesting to observe that the ‘V’ features has near zero correlation to each other, which makes sense, since they were scaled using PCA filter.

We want to use features that have a high correlation (positively or negatively) with the ‘Class’ feature. If a feature is not correlated with your target variable, it means that changes in the values of this feature does not correspond to changes in the target variable. So, instead of helping the modeling process, this feature may harm the classification results.

Modeling tests

To evaluate the model we split the dataset into 80% train and 20% test, which will be hold out for final evaluation. In the training process we also performed cross validation with five folds.

Now, we will provide a graphical analysis of all the tests performed using the original dataset. For each configuration, we used different threshold values to see which one will provide the best precision/recall set. The classifier used is written in the title of each figure.

All features: best_threshold = 0.4 (recall = 0.867347, precision = 0.894737, g-mean = 0.931216).

15 best features (higher correlation with Class): best_threshold = 0.4 (recall = 0.887755, precision = 0.915789, g-mean = 0.942141).

10 best features (higher correlation with Class): best_threshold = 0.4 (recall = 0.887755, precision = 0.887755, g-mean=0.942116).

5 best features (higher correlation with Class): best_threshold = 0.5 (recall = 0.857143, precision = 0.913043, g-mean = 0.925755).

15 best features (higher correlation with Class): best_threshold = 0.1 (recall = 0.877551, precision = 0.796296, g-mean = 0.936596).

From these initial sets of tests (and other made that are not described in here), we can conclude that Random Forest performs better than Logistic Regression. Furthermore, removing outliers does not improve the model performance.

The best configuration was achieved using the 15 features with higher correlation with the class feature and with Random Forest classifier, which resulted in a recall = 0.887755, precision = 0.915789 and g-mean =0.942141.

Random Undersampling

This technique undersample the majority class randomly and uniformly. This can potentially lead to loss of information, but if the examples of the majority class are near to others, this method might yield good results [Dataman].

Our train set has 388 fraudulent samples. Therefore, by applying the random undersample technique, our new train dataset will have 388+388=776 samples (considering a 50/50 ratio).

Correlation Analysis

The image below shows a heat map of the correlation between each feature in this new training dataset.

Now, as the number of samples decreased significantly, the correlation between the variables increased. In this sense, we should have to chose carefully the features that we intend to use in the training procedure to avoid multicollinearity.

However, the features with higher correlation to the Class features have also a high correlation between themselves, which can harm the training process.

Modeling tests

Using initially this 50/50 ratio dataset, we got the following results:

All features: best_threshold = 0.65 (recall = 0.908163, precision = 0.096009, g-mean=0.945928).

All features: best_threshold = 0.9 (recall = 0.918367, precision = 0.206422, g-mean = 0.955395).

15 best features (higher correlation with Class): best_threshold = 0.65 (recall = 0.908163, precision = 0.088822, g-mean=0.945295).

10 best features (higher correlation with Class): best_threshold = 0.65 (recall = 0.918367, precision = 0.087294, g-mean=0.950353).

3 best features with low multicollinearity: best_threshold = 0.75 (recall = 0.867347, precision = 0.067088, g-mean = 0.921584).

As we can see, the results using Random Undersampling are quite different from the ones presented previously with the original data. Even though it may look not good due to the low precision rate, we should not simply discard these results! So let’s analyse it!

The bests results in here was using Random Forest classifier with 10 features, which provided recall = 0.918367, precision = 0.087294 and g-mean=0.950353, and Logistic Regression classifier with all features, which provided recall = 0.918367, precision = 0.206422 and an impressive g-mean = 0.955395! This g-mean was higher than the best solution using the original dataset.

We have to consider that sometimes, depending of the application, getting a higher recall upon giving up precision is an interesting solution. In this project, for example, one may consider that it is better to predict that a non-fraudulent transaction is fraudulent than the opposite (recall prioritized over precision). This way, it is better to have the annoyance of your payment reject than losing quite a lot of money.

Another good example to understand this solution is with medical projects. It is better that the machine learning model wrongly predict that someone has a cancer disease than not predicting it and the person really has.

However, if you are not happy with the low precision presented, you may increase the non-fraudulent samples to be used in the training process, so that the ratio between non-fraudulent/fraudulent be different then 50/50. Let’s try this out.

10 best features with 75/25 ratio: best_threshold = 0.55 (recall = 0.918367, precision = 0.149007, g-mean = 0.953974).

10 best features with 90/10 ratio: best_threshold = 0.55 (recall = 0.908163, precision = 0.497207, g-mean = 0.952222).

10 best features with 99/1 ratio: best_threshold = 0.10 (recall = 0.908163, precision = 0.088206, g-mean = 0.945235).

From the three plots just above, it can be seen that the results look even more similar with the unbalanced results showed in the previous section, which is expected, as we are increasing the ratio proportion and the unbalance between the training data.

Furthermore, we got a g-mean score even higher than before considering Random Forest classifiers!! Using a 75/25 ratio, we got recall = 0.918367, precision = 0.149007 and g-mean = 0.953974. If we compare with the previous results in this section, our recall remained the same, but our precision almost doubled! So, keep in mind that changing the dataset proportion may improve your results.

SMOTE

SMOTE is a statistical technique for increasing the number of cases in your dataset in a balanced way. The module works by generating new instances from existing minority cases that you supply as input. This implementation of SMOTE does not change the number of majority cases [Microsoft].

The new instances are not just copies of existing minority cases; instead, the algorithm takes samples of the feature space for each target class and its nearest neighbours, and generates new examples that combine features of the target case with features of its neighbours. This approach increases the features available to each class and makes the samples more general [Microsoft].

Our train set has 388 fraudulent samples and 227,451 non-fraudulent samples. Therefore, by applying the SMOTE technique, our new train dataset will have 454,902 samples (considering a 50/50 ratio).

Modeling tests

Using this 50/50 ratio dataset, we got the following results:

All features: best_threshold = 0.2 (recall = 0.918367, precision = 0.312500, g-mean=0.956645).

15 best features (higher correlation with Class): best_threshold = 0.1 (recall = 0.938776, precision = 0.061662, g-mean=0.956903).

15 best features (higher correlation with Class): best_threshold = 0.8 (recall = 0.908163, precision = 0.096425, g-mean=0.945962).

10 best features (higher correlation with Class): best_threshold = 0.2 (recall = 0.887755, precision = 0.213235, g-mean=0.939544).

Similar to what happened when applying Random Undersampling technique, the results in the section shows a poor precision value.

We may highlight two output results:
(1) Random Forest classifier with all features, which provided recall = 0.918367, precision = 0.312500 and g-mean=0.956645.
(2) Random forest with 15 features which provided recall = 0.938776, precision = 0.061662, g-mean=0.956903.

The (1) output is better than the best output considering Random Undersampling! We maintained the same recall value, but our precision increased from 0.206422 to 0.312500 and our g-mean from 0.955395 to 0.956645! If you want a high value of recall considering a not so low value of precision, just go for output (1)!

If you want a high recall no matter what the precision value is, output (2) is the best option so far! However, we do not recommend it, since the precision is way too low and the number of non-fraud transactions labelled as fraud is considerably high.

Conclusion

This article proposed an extensive analysis of results regarding techniques that deal with unbalanced datasets. Depending of what metric you want to optimize, applying these techniques may improve your model performance.

During the discussion of the results, we tried to explain the consequences of choosing one model over another. Furthermore, the metric you want to optimize depends exclusively of your application.

Between Random Undersampling and SMOTE techniques, SMOTE provided better results but with the disadvantage of taking much longer to build the models.

Future works include testing more SMOTE configurations of the as well as adding other techniques to deal with unbalanced datasets.

Hope you have liked this article! See ya!