Coder Social home page Coder Social logo

ekagra-ranjan / analyze-this-17 Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 11.15 MB

Analyze This' 17, American Express Data Science Competition ('Outstanding Performer')

Jupyter Notebook 100.00%
data-science analyze-this data-science-challenges data-science-competition xgboost xgb-classifier corelib sklearn svm random-forest

analyze-this-17's Introduction

Analyze-This-17

(American Express Flagship Data Science Competition)

Declared 'Outstanding Performer' by American Express.



Estimation Technique Used

We used a Gradient Boosting Machine – implementing the xgboost Python library.

A Gradient Boosting Machine iteratively trains decision trees, and ensembles them (essentially combining their predictions), all while minimizing the error function (in our case, the ‘softmax’ function) at each step. The xgboost library provides an out-of-the-box implementation of a GBM.It is a form of extreme Gradient Boosting.

We used transformation (log or cuberoot) to make the distribution Gaussian as well as normalized 21 feature columns, whiich we got after removing, and combining some of the input data – explained in the following slides.

Strategy to decide final list

The output of the xgboost classifier is an array of 4 probabilities – corresponding to how likely the model predicts “None”, “Supp”, “Elite”, “Credit” respectively, for each entry.

We removed the “None” probability from the entire array of predictions (called y_final, corresponding to the input data stored in X_final) and sorted y_final according to the maximum of the remaining three probabilities, in each entry.

This ensured that the entries for which the xgboost classifier was most sure about were at the top of the calling list. We predicted the maximum of the remaining three classes as the output for each entry.

Details of each Variable used in the logic/mode/strategy

  • We removed outliers from dataset based on very high values of Electronics, Travel, Household, Car, Retail, Total Expenditures. We also removed samples which had unusually small income.

  • We combined the quarterly Electronics, Travel, Household, Car, Retail, Total Expenditures into yearly features, by summing them.

  • We imputed the values of Income.

  • We dropped the columns Industry Code because there were too many missing values, throwing our classifier off. Card Product Type had all entries as ‘Charge’, meaning it had no significance altogether, and was dropped as well. Customer Spending Capacity was having ~24000 values missing out of ~40000, so we didn’t bother to impute this variable using its mean , median or mode beacuse it had too much variability and it would lead to bad estimation by comution 24k values based only on 16k values. So we dropped this column too.

  • Indicators of extension/acceptance for supplementary, elite and credit cards were combined into three features, one for each card type as, (number of times customer accepted)/(number of times an offer was extended), beacuse this would us an intuition that whats the probability that the customer will accept a specific card if we offer them.We also took the reciprocal of the values obtained by this ratio, if the ratio was greater than 1(because accepted cannot be more than extended) because we thought that this could be because of some practical reasons like typing mistake.

  • Variables leaving out mvar3, mvar10, mvar41-51 were transformed using log while mvar3 was transformed using cuberoot function.Each column was subtracted by its mean and divided by its standard deviation. These were done to get a normailsed Gaussian distribution os the features for better optimisation and better classification.

  • The xgboost classifier worked best with the following parameters:

    • Objective: softprob (similar to the softmax function)
    • Learning Rate: 0.1
    • Number of Estimators: 1000
    • Min Child Weight =5
    • Maximum Depth (for each tree): 5, sub_sample=0.8 ,col_pos_weight=0.8


Github repos of similar Data Science Competitions:

Please star the repo if you found the materials in the repo useful :)

analyze-this-17's People

Contributors

ekagra-ranjan avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.