Coder Social home page Coder Social logo

luuisotorres / credit-card-fraud-detection Goto Github PK

View Code? Open in Web Editor NEW
3.0 1.0 1.0 762 KB

For this project, I used four different classification algorithms to detect fraudulent credit card transactions.

Home Page: https://www.kaggle.com/code/lusfernandotorres/91-87-recall-with-ada-boost-cc-fraud-detection/notebook

Jupyter Notebook 100.00%
classification-algorithm classification-model credit-card-fraud-detection data-science data-visualization decision-tree-classifier machine-learning machine-learning-algorithms random-forest-classifier scikit-learn

credit-card-fraud-detection's Introduction


Using Machine Learning to Detect Credit Card Fraud

--

Python Jupyter Notebook scikit-learn
Pandas Plotly

--

Introduction

Every day, billions of credit card transactions are made all over the world. Considering the widespread use of smartphones and the Internet throughout the earth, more and more people are using their credit cards to make purchases online, making payments through apps,etc...

In a scenario such as this one, it is extremely important that credit card companies are able to easily recognize when a transaction is a result of a fraud or a genuine purchase, avoiding that customers end up being charged for items they did not acquire.

In this project, I used the scikit-learn library to develop a prediction model that is able to learn and detect when a transaction is a fraud or a genuine purchase. I tested four different classification algorithms, Decision Tree, Random Forest, Ada Boost Classifier and Gradient Boosting to identify which one of them would achieve the best results with our dataset.

Development

In order to develop a predictive model, I used scikit-learn library and tested four different classification algorithms to see which one of them would achieve higher accuracy metrics for the dataset.

Pandas, numpy, matplotlib, seaborn and plotly libs were also used to explore, handle and visualize relevant data for this projetc.

The dataset used for this project was the Credit Card Fraud Detection dataset posted on Kaggle, which contains credit card transactions that happened during the month of September, 2013 by european clients for two days.

The dataset has the feature time, which shows us the seconds elapsed between each transaction and the first transaction in the dataset. The feature amount, containing the transaction amount and the feature class, which tells us if that certain transaction is genuine or a fraud, where 1 = fraud and 0 = genuine.

Features V1, V2,... V28 are numerical input variables result of a PCA transformation whose content couldn't be displayed due to their confidential nature.

During the development of this project, I've used certain tools from the sklearn library that would help me in achieving higher performance metrics for the models, such as the StandardScaler, used to alter the scale of amount variable and SMOTE, from imblearn library, used to deal with data imbalance. Both tools were used to avoid creating a model that would be biased towards a determined variable or a determined class.

Evaluation Metrics for Classification Models

When dealing with classification models, there are some evaluation metrics that we can use in order to see the efficiency of our models.

One of those evaluation metrics is the confusion matrix which is a summary of predicted results compared to the actual values of our dataset. This is what a confusion matrix looks like for a binary classification problem:



TP is for True Positive and it shows the correct predictions of a model for a positive class.
FP is for False Positive and it shows the incorrect predictions of a model for a positive class.
FN is for False Negative and it shows the incorrect predictions of a model for a negative class.
TN is for True Negative and it shows the correct predictions of a model for a negative class.

Beyond the confusion matrix, we also have some other relevant metrics. They are:

Accuracy

Accuracy simply tells us the proportion of correct predictions. This is how we calculate it:


Precision

Precision tells us how frequently our model correctly predicts positives. This is how we calculate it:


Recall

Recall, which can also be referred to as sensitivity, can tell us how well our model predicts the class that we want to predict. This is how we calculate it:


F1 Score

Lastly, F1 Score is the harmonic mean of precision and recall.This is how we calculate it:



Conclusion

When we work with a machine learning model, we must always know for a fact what it is that we're trying to get from that model.

In this project, our goal is to detect fraudulent transactions when they occur, and the model who best performed that task was the Ada Boost Classifier with a recall of 91.87%, correctly detecting 147 fraudulent transactions out of 160. However, it is also important to note that the Ada Boost classifier had the biggest number of false positives, that is, 1321 genuine transactions were mistakenly labeled as fraud, that's 1.54% of all genuine transactions.

A genuine purchase being incorrectly identified as a fraud could be a problem.

In this scenario it is necessary to understand the business and make a few questions such as:

  • how cheap would a false positive be?

  • Would we keep the Ada Boost Classifier with the best performance in detecting frauds, while also detecting a lot of false positives or should we use the Random Forest Classifier, who also performed pretty well identified frauds (82.50% recall) and reduced the number of false positives (0.02% of genuine transactions flagged as fraud). But that would also imply in a larger number of fraudsters getting away with it and customers being mistakenly charged...

These questions and a deeper understading of how the business works and how we want to approach solving a problem using machine learning are fundamental for a decision-making process to choose whether or not if we're willing to deal with a larger number of false positives to detect the largest amount of frauds as possible.


Kaggle

I've also uploaded this notebook to Kaggle, where plotly graphics are interactive. If you wish to see it, please click here.

Author

Luís Fernando Torres

credit-card-fraud-detection's People

Contributors

luuisotorres avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.