Kaggle BNP Paribas challenge

All input files and created datasets are stored in /input but ignored from git.

All metafeature datasets are stored in metafeatures and a seperate gDrive account is active to share these (rather than run and reproduce) and ignored from git.

All submissions are stored in submissions but ignored from git.

Python Dependencies

numpy (<1.10.0)
scipy (<0.17.0)
scikit-learn (<0.16.1)
bayesian-optimization (pip install git+https://github.com/fmfn/BayesianOptimization.git)

Strategy

In this challenge, BNP Paribas Cardif is providing an anonymized database with two categories of claims:

claims for which approval could be accelerated leading to faster payments.
claims for which additional information is required before approval.

This means we are dealing with a binary classification task, and specifically we are measured on the logloss metric.

In order to seperate work amongst team members it is important to describe the high level framework we intend on using (details of specifics can be found subsequent sections), so we are able to optimize each stage.

Dataset and Feature generation - We aim to create multiple datasets that are diverse in nature
Stacking of level 1 models - For different datasets we create metafeatures based on a variety of models (XGboost, Extratrees, Factorization Machines, etc.)
Feature selection from level 1 stacked meta features, from the potential 100s of new features created we must eliminate features for second level stacking and ensembling.
Final blending of level 2 stacked features (based on classic train / validation)
Submission.

Data

Dataset building

The datasets are generated using two scripts: ./R/build_datasets.R and ./python/build_datasets.py. Each dataset is described in a section below - in most cases the naming convention of files is based on those, so e.g. MP1 is stored as ./input/{xtrain,xtest}_MP1.csv.

In the remainder of this section brackets near the name indicate which script was used to generate a particular dataset.

buildMP1 (R)

count missing values per row
replace all NA with -1
map all characters to integers - this means the MP1 dataset makes sense as input only for tree-based models

buildKB1 (R)

count missing values per row
replace all NA with -1
addition of quadratic factors (all pairwise combinations of categorical variables)
map all factors to integers - this means the KB1 dataset makes sense as input only for tree-based models

buildKB2 (R)

count missing values per row
replace all NA with -1
addition of quadratic factors (all pairwise combinations of categorical variables)
addition of cubic factors (all three-way combinations of categorical variables)
map all factors to integers - this means the KB2 dataset makes sense as input only for tree-based models

buildKB3 (R)

count missing values per row
replace all NA with -1
addition of quadratic factors (all pairwise combinations of categorical variables)
all factors mapped to response rates

buildKB4 (R)

count missing values per row
replace all NA with -1
addition of quadratic factors (all pairwise combinations of categorical variables)
addition of cubic factors (all three-way combinations of categorical variables)
all factors mapped to response rates via (cross-validated) linear mixed-effects models using the lmer package in R

buildKB5/buildKB6 (R)

KB15

KB16

KB6099 as basis
SVD (via sklearn.decomposition.TruncatedSVD) with n_components as function argument

Running

Run in dataset_creation run data_preperation.R
Run all build_meta_XX.py in the python subdir
We have now produced many metafile that need to be joined into a single dataset

build_linear_combo_selection.R is an R script to merge all metafiles and remove any linear combinations from the dataset.
build_2ndlLvl_selection.py takes the output form the above and then build more features by ranking the top N results and taking interactions of these vars.

TODO: PRODUCE SECOND LEVEL MODELS -> NN / XGB / RF / ET (Only best models)
Final stage is to blend the above models weights. (python L-BFGS-L or other optim methos in scipy - mpearmian to produce.)

We follow the convention adopted in Kaggle scripts, so R scripts should be executed from within the R subfolder (relative paths are given as ../submissions etc)

mpearmain / bnp Goto Github PK

bnp's Introduction

Kaggle BNP Paribas challenge

Python Dependencies

Strategy

Data

Dataset building

buildMP1 (R)

buildKB1 (R)

buildKB2 (R)

buildKB3 (R)

buildKB4 (R)

buildKB5/buildKB6 (R)

KB15

KB16

Running

bnp's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org