Coder Social home page Coder Social logo

jasonmdev / learning-python-predictive-analytics Goto Github PK

View Code? Open in Web Editor NEW
41.0 5.0 38.0 1.91 MB

Tracking, notes and programming snippets while learning predictive analytics

Python 100.00%
predictive-analytics python dataset logistic-regression linear-regression

learning-python-predictive-analytics's Introduction

Predictive Analytics with Python

These are my notes from working through the book Learning Predictive Analytics with Python by Ashish Kumar and published on Feb 2016.

General

###Chapter 1: Getting Started with Predictive Modelling

  • Installed Anaconda Package.
  • Python3.5 has been installed.
  • Book follows python2, so some codes is modified along the way for python3.

###Chapter 2: Data Cleaning

  • Reading the data: variations and examples
  • Data frames and delimiters.

####Case 1: Reading a dataset using the read_csv method

  • File: titanicReadCSV.py
  • File: titanicReadCSV1.py
  • File: readCustomerChurn.py
  • File: readCustomerChurn2.py
  • File: changeDelimiter.py

####Case 2: Reading a dataset using the open method of Python

  • File: readDatasetByOpenMethod.py

####Case 3: Reading data from a URL

  • Modified the code that it works and prints out line by line dictionary of the dataset.
  • File: readURLLib2Iris.py
  • File: readURLMedals.py

####Case 4: Miscellaneous cases

  • File: readXLS.py
  • Created the file above to read from both .xls an .xlsx

####Basics: Summary, dimensions, and structure

  • File: basicDataCheck.py
  • Created the file above to read from both .xls an .xlsx

####Handling missing values

  • File: basicDataCheck.py
  • RE: Treating missing data like NaN or None
  • Deletion orr imputaion

####Creating dummy variables

  • File: basicDataCheck.py
  • Split into new variable 'sex_female' and 'sex_male'
  • Remove column 'sex'
  • Add both dummy column created above.

####Visualizing a dataset by basic plotting

  • File: plotData.py
  • Figure file: ScatterPlots.jpeg
  • Plot Types: Scatterplot, Histograms and boxplots

###Chapter 3: Data Wrangling ####Subsetting a dataset

  • Selecting Columns
  • File: subsetDataset.py
  • Selecting Rows
  • File: subsetDatasetRows.py
  • Selecting a combination of rows and columns
  • File: subsetColRows.py
  • Creating new columns
  • File: subsetNewCol.py

####Generating random numbers and their usage

  • Various methods for generating random numbers
  • File: generateRandomNumbers.py
  • Seeding a random number
  • File: generateRandomNumbers.py
  • Generating random numbers following probability distributions
  • File: generateRandomProbDistr.py
  • Probability density function: PDF = Prob(X=x)
  • Cumulative density function: CDF(x) = Prob(X<=x)
  • Uniform distribution: random variables occur with the same (uniform) frequency/probability
  • Normal distribution: Bell Curve and most ubiquitous and versatile probability distribution
  • Using the Monte-Carlo simulation to find the value of pi
  • File: calcPi.py
  • Geometry and mathematics behind the calculation of pi
  • Generating a dummy data frame
  • File: generateDummyDataFrame.py

####Grouping the data – aggregation, filtering, and transformation

  • File: groupData.py
  • Grouping
  • Aggregation
  • Filtering
  • Transformation
  • Miscellaneous operations

####Random sampling – splitting a dataset in training and testing datasets

  • File: splitDataTrainTest.py
  • Method 1: using the Customer Churn Model
  • Method 2: using sklearn
  • Method 3: using the shuffle function

####Concatenating and appending data

  • File: concatenateAndAppend.py
  • File: appendManyFiles.py

####Merging/joining datasets

  • File: mergeJoin.py
  • Inner Join
  • Left Join
  • Right Join
  • An example of the Inner Join
  • An example of the Left Join
  • An example of the Right Join
  • Summary of Joins in terms of their length

###Chapter 4: Statistical Concepts for Predictive Modelling ####Random sampling and central limit theorem ####Hypothesis testing

  • Null versus alternate hypothesis
  • Z-statistic and t-statistic
  • Confidence intervals, significance levels, and p-values
  • Different kinds of hypothesis test
  • A step-by-step guide to do a hypothesis test
  • An example of a hypothesis test

####Chi-square testing ####Correlation

  • File: linearRegression.py
  • File: linearRegressionFunction.py
  • Picture: TVSalesCorrelationPlot.png
  • Picture: RadioSalesCorrelationPlot.png
  • Picture: NewspaperSalesCorrelationPlot.png

###Chapter 5: Linear Regression with Python ####Understanding the maths behind linear regression

  • Linear regression using simulated data
  • File: linearRegression.py
  • Picture: CurrentVsPredicted1.png
  • Picture: CurrentVsPredictedVsMean1.png
  • Picture: CurrentVsPredictedVsModel1.png

####Making sense of result parameters

  • File: linearRegression.py
  • p-values
  • F-statistics
  • Residual Standard Error (RSE)

####Implementing linear regression with Python

  • File: linearRegressionSMF.py
  • Linear regression using the statsmodel library
  • Multiple linear regression
  • Multi-collinearity: sub-optimal performance of the model
  • Variance Inflation Factor
  • It is a method to quantify the rise in the variability of the coefficient estimate of a particular variable because of high correlation between two or more than two predictor variables.

####Model validation

  • Training and testing data split
  • File: linearRegressionSMF.py
  • Linear regression with scikit-learn
  • File: linearRegressionSKL.py
  • Feature selection with scikit-learn
  • Recursive Feature Elimination (RFE)
  • File: linearRegressionRFE.py

####Handling other issues in linear regression

  • Handling categorical variables
  • File: linearRegressionECom.py
  • Transforming a variable to fit non-linear relations
  • File: nonlinearRegression.py
  • Picture: MPGVSHorsepower.png
  • Picture: MPGVSHorsepowerVsLine.png
  • Picture: MPGVSHorsepowerModels.png
  • Handling outliers
  • Other considerations and assumptions for linear regression

###Chapter 6: Logistic Regression with Python ####Linear regression versus logistic regression ####Understanding the math behind logistic regression

  • File: logisticRegression.py
  • Contingency tables
  • Conditional probability
  • Odds ratio
  • Moving on to logistic regression from linear regression
  • Estimation using the Maximum Likelihood Method
  • Building the logistic regression model from scratch
  • File: logisticRegressionScratch.py
  • Read above again.
  • Making sense of logistic regression parameters
  • Wald test
  • Likelihood Ratio Test statistic
  • Chi-square test
  • [x]

####Implementing logistic regression with Python

  • File: logisticRegressionImplementation.py
  • Processing the data
  • Data exploration
  • Data visualization
  • Creating dummy variables for categorical variables
  • Feature selection
  • Implementing the model

####Model validation and evaluation

  • File: logisticRegressionImplementation.py
  • Cross validation

####Model validation

  • File: logisticRegressionImplementation.py
  • The ROC curve {see terms}

###Chapter 7: Clustering with Python ####Introduction to clustering – what, why, and how?

  • What is clustering?
  • How is clustering used?
  • Why do we do clustering?

####Mathematics behind clustering

  • Distances between two observations
  • Euclidean distance
  • Manhattan distance
  • Minkowski distance
  • The distance matrix
  • Normalizing the distances
  • Linkage methods
  • Single linkage
  • Compete linkage
  • Average linkage
  • Centroid linkage
  • Ward's method uses ANOVA method
  • Hierarchical clustering
  • K-means clustering
  • File: kMeanClustering.py

####Implementing clustering using Python

  • File: clusterWine.py
  • Importing and exploring the dataset
  • Normalizing the values in the dataset
  • Hierarchical clustering using scikit-learn
  • K-Means clustering using scikit-learn
  • Interpreting the cluster

####Fine-tuning the clustering

  • The elbow method
  • Silhouette Coefficient

###Chapter 8: Trees and Random Forests with Python ####Introducing decision trees

  • A decision tree

####Understanding the mathematics behind decision trees

  • Homogeneity
  • Entropy
  • Information gain
  • ID3 algorithm to create a decision tree
  • Gini index
  • Reduction in Variance
  • Pruning a tree
  • Handling a continuous numerical variable
  • Handling a missing value of an attribute

####Implementing a decision tree with scikit-learn

  • File: decisionTreeIris.py
  • Visualizing the tree
  • Picture: dtree2.png
  • File: dtree2.dot
  • Cross-validating and pruning the decision tree

####Understanding and implementing regression trees

  • File: regressionTree.py
  • Regression tree algorithm
  • Implementing a regression tree using Python

####Understanding and implementing random forests

  • File: randomForest.py
  • The random forest algorithm
  • Implementing a random forest using Python
  • Why do random forests work?
  • Important parameters for random forests

###Chapter 9: Best Practices for Predictive Modelling ####Best practices for coding

  • Commenting the codes
  • Defining functions for substantial individual tasks
  • Example 1
  • Example 2
  • Example 3
  • Avoid hard-coding of variables as much as possible
  • Version control
  • Using standard libraries, methods, and formulas

####Best practices for data handling

####Best practices for algorithms

####Best practices for statistics

####Best practices for business contexts

learning-python-predictive-analytics's People

Contributors

jasonmdev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.