Coder Social home page Coder Social logo

randmose / ieee-cis-fraud-detection Goto Github PK

View Code? Open in Web Editor NEW

This project forked from emisaycheese/ieee-cis-fraud-detection

0.0 0.0 0.0 4.73 MB

Kaggle Fraud Transaction Detection competition

Home Page: https://www.kaggle.com/c/ieee-fraud-detection/

Python 6.05% Jupyter Notebook 93.95%

ieee-cis-fraud-detection's Introduction

IEEE-CIS-Fraud-Detection

Kaggle Fraud Transaction Detection competition


alt text

This is a private repo for winning the Kaggle Fraud Transaction Detection competition](https://www.kaggle.com/c/ieee-fraud-detection/) at 217 out of 6300 teams (3%) !

Screenshot place holder.

Account Id

In this repo, we organize our Kaggle core codes into 2 parts: feature engineering (data pipline) and model training.

Feature Engineering

We did the minimal data cleaning work for this project, as LightGBM model can automatically handle quite a lot, e.g., missing values, outliers. The current optimal way would be the default choice of the model.

We have different methods for continuous and categorical data correspondingly.

Continuous Data.

We have done a variety of mathematical transformation based on the data distributions.

Categorical Data

We spent the majority of our time on this.

  • Risk Mapping and feature embeding. We understood there was a risk of overfitting but we took extra care to deal with this potential issue.

  • Label Transformation. For some categorical features with unknown meanings and less than their unique values was less than a threshold, as this suggested, we used the label transformation function directly from sklearn. We are aware of the bad implications of this but maybe for tree model, the adversarial impacts are trivial or not significant.

Domain Knowledge: this technique incoporates the common knowledge of our daily life, like a transaction made online is more dangerous than local store, or the hours of transaction made and so on.

Interaction Terms: we used the interactions among features depending on the sole critieror that after the interaction, we could see the boundary of two classes more clearly

The most surprising features we found being useful was countings for the nulls of each row, which makes sense and simple, "KISS" as our law.

We used the H2O autoML tool to generate insights/reduce dimensions from some not so important features, e.g., V-type features, features with too many missing values, invarient features.

Data Pipline

It is coming soon!

Model Training

Model

For lightGBM, we had to choose a booster as the core model, we chose DART for the following reasons:

Graph place holder (from qishi ml journal club teng)

  • Combine the advantages of random forest and gradient boosting tree models
  • Our first few trees are not so important thus reducing the effects of overfitting.
  • But really slow to train.

(Not ready yet, T will update this)

Model Segmentation

We have two types of modeling approaches, one is to separate the data via the most important feature - transaction product type and the other one is not. The reason we took these approaches are that we found the product type had dramatic impacts on other features' distributions, missing percentages.

Please find more visualzation evidence in [notebook]. ( T will update this notebook)

Model Training

For both approaches, we used the CV modules to address the overfiting issues.

  • Cross-validation module - dealing with overfitting issues
  • Time-series module - dealing with the time problem. Remember, the train and test dataset are sequential, which test dataset comes later in time

Hyper-Parameter Optimization

  • Grid Search module
  • Random Search module
  • Bayes optimization module - dealing with high dimensional hyper-parameter space searching

Why do we use different methods to search parameters?

Because the search space for LightGBM is so huge and we try to use different methods to find diversified local optimal solutions. (T maybe updates this possibly)

Model stacking

We built the models from a diversified background to reduce the results correlation and thus achieve a better performance in model stacking.

  • Model stacking module - dealing with overfitting issues

In addition, after an in-depth research into the computational mechanisms of ROC AUC scores, we found a clever way to play with it and increase the performance.


What are the differentiating factors?

Techonology & Experiences

  • We utilize the cutting-edge machine learning modeling framework and tracking system to greatly increase our collaborating efficiency. Featuretools that let us to focus on the most important features. In addition, we also used mlflow to track our ML experiments scientifically and control the submissions so that we do not repeat the experiments and avoid overfiting.
  • We are a diversified team of data scientist veterans from consulting, banking and hedgefund industry.

Work for us?

Write separate functions in py file as utils and use notebook to call the necessary

  • Model Stack module V 3
  • Model Segmentation module V 2
  • Parameter optimizations: 3 module E 1
  • CV and times series E 1
  • Data Visual T&E 1
  • Show the fancy techs (gradually) T 4

ieee-cis-fraud-detection's People

Contributors

emisaycheese avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.