Coder Social home page Coder Social logo

zhwj7552 / kaggle_credit_default_loan Goto Github PK

View Code? Open in Web Editor NEW

This project forked from jk34/kaggle_credit_default_loan

0.0 2.0 0.0 8.95 MB

My work using Python on data from a Kaggle competition on credit scoring to predict defaults

Python 100.00%

kaggle_credit_default_loan's Introduction

Kaggle_Credit_Default_Loan

My work using Python on data from a Kaggle competition on credit scoring to predict defaults

I have some interest in working on data sets used by banks to determine the credit score of their customers to predict the likelihood that the customer would default on a potential loan. The data set was obtained from: https://www.kaggle.com/c/GiveMeSomeCredit

I used Python to work on this data set. I start with Logistic Regression. Because the training set has lots of NA values, I will get rid of the entries that contain any NA values

The "30-59DaysPastDueNotWorse" variable contains values 96 and 98, which are typos, so I will replace them with the median of 30-59DaysPastDueNotWorse

I then generate histograms of the 'age' (hist_age.pdf) and 'NumberOfTime30-59DaysPastDueNotWorse' (hist_late.pdf) variables

I can also visualize the data with a KDE plot, which is similar to histogram in that it treats each data point as Gaussian distribution and then takes cumulative probability function. The resulting plot is "KDE.pdf"

I also want to generate plots that show how many of the entries contained defaults and non-defaults, along with factor plots (using the Seaborn package) that shows how the defaults varies depending on the age and number of dependents of each person. The plot that just shows the number of defaults vs non-defaults is "default.pdf", the plot showing the defaults vs age is "default_age.pdf", and the plot showing the defaults vs number of dependents is "default_dependents.pdf"

I then use linear plots from Seaborn to see how the number of defaults correlates with the 'NumberOfTime30-59DaysPastDueNotWorse' variable. The plot showing the defaults vs "NumberOfTime30-59DaysPastDueNotWorse" is "late.pdf", the plot with bins is "late_bins.pdf", and the plot of defaults vs age with NumberOfTime30-59DaysPastDueNotWorse as the hue is "late_age_linear.pdf"

Now I will proceed with Logistic Regression. We first set the dependent variable as the defaults and convert it into a 1-d array as required by Scikit-learn Y=trainDF.SeriousDlqin2yrs print "Y.head is ", Y.head() #need to convert this to 1-d array to use with scikit-learn Y = np.ravel(Y) print "Y is ", Y We then compute the score when using Logistic Regression on the entire training set. It turns out to be 93.06%. However, this is only a marginal improvement from the actual percentage of non-defaults in the dataset, which is 93.05% .To improve on this score, I will try Regularization with the Lasso l1 penalty. We now split the training set into a training and validation/test set. Python automatically converts 75% of the original set into a new training set and the remaining 25% becomes the validation set. The resulting plot showing the coefficients as a function of the log of C (where C=1/lambda, where lambda is the penalty term. The greater the lambda, the more the coefficients of the predictors tends towards 0, thus eliminating the irrelevant predictors) Many of the coefficients go towards 0 when C=0 (or lambda = inf). The accuracy scores are 1.0 for C values = 1, 316.2, 100000, 3.16e7, etc. However, the score is .9301 when C=1e-5 and .99963 when C=.003 (logC = -2.5). From the plot ("l1_coefs.pdf"), it is hard to determine which predictors become 0 due to increasing C. It seems that the most relevant predictors appear to be DebtRatio, age, NumberRealEstateLoansOrLines, and NumberOfOpenCreditLinesAndLoans.

I will then use only these relevant predictors in another Logistic Regression. However, with just these predictors, the accuracy drops to .9277

kaggle_credit_default_loan's People

Contributors

jk34 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.