Coder Social home page Coder Social logo

dsc-ols-statsmodels-lab-seattle-ds-career-040119's Introduction

Ordinary Least Squares in Statsmodels (OLS) - Lab

Introduction

Previously, you looked at all the requirements for running an OLS simple linear regression using Statsmodels. You worked with the height-weight data set to understand the process and all of the necessary steps that must be performed. In this lab , you'll explore a slightly more complex example to study the impact of spending on different advertising channels on total sales.

Objectives

You will be able to:

  • Perform a linear regression using statsmodels
  • Evaluate a linear regression model by using statistical performance metrics pertaining to overall model and specific parameters
  • Determine if a particular set of data exhibits the assumptions of linear regression

Let's get started

In this lab, you'll work with the "Advertising Dataset", which is a very popular dataset for studying simple regression. The dataset is available on Kaggle, but we have downloaded it for you. It is available in this repository as advertising.csv. You'll use this dataset to answer this question:

Which advertising channel has the strongest relationship with sales volume, and can be used to model and predict the sales?

Step 1: Read the dataset and inspect its columns and 5-point statistics

# Load necessary libraries and import the data
# Check the columns and first few rows
# Get the 5-point statistics for data 
# Describe the contents of this dataset

Step 2: Plot histograms with kde overlay to check the distribution of the predictors

# For all the variables, check distribution by creating a histogram with kde
# Record your observations here 

Step 3: Test for the linearity assumption

Use scatterplots to plot each predictor against the target variable

# visualize the relationship between the preditors and the target using scatterplots
# Record yor observations on linearity here 

Conclusion so far

Based on above initial checks, we can confidently say that TV and radio appear to be good predictors for our regression analysis. Newspaper is very heavily skewed and also doesnt show any clear linear relationship with the target.

We'll move ahead with our analysis using TV and radio, and rule out newspaper because we believe it violates OLS assumptions

Note: Kurtosis can be dealt with using techniques like log normalization to "push" the peak towards the center of distribution. You'll learn about this later on.

Step 4: Run a simple regression in Statsmodels with TV as a predictor

# import libraries

# build the formula 

# create a fitted model in one line

Step 5: Get Regression Diagnostics Summary

Note here that the coefficients represent associations, not causations

Step 6: Draw a prediction line with data points on a scatter plot for X (TV) and Y (Sales)

Hint: You can use the model.predict() function to predict the start and end point of of regression line for the minimum and maximum values in the 'TV' variable.

# create a DataFrame with the minimum and maximum values of TV

# make predictions for those x values and store them


# first, plot the observed data and the least squares line

Step 7: Visualize the error term for variance and heteroscedasticity

# Record Your observations on heteroscedasticity

Step 8: Check the normality assumptions by creating a QQ-plot

# Code for QQ-plot here
# Record Your observations on the normality assumption

Step 9: Repeat the above for radio and record your observations

# code for model, prediction line plot, heteroscedasticity check and QQ normality check here
model.summary()
# Record your observations here for goodnes of fit 

The Answer

Based on the above analysis, you can conclude that none of the two chosen predictors is ideal for modeling a relationship with the sales volumes. Newspaper clearly violated the linearity assumption. TV and radio did not provide a high value for the coefficient of determination, where TV performed slightly better than the radio. There is obvious heteroscdasticity in the residuals for both variables.

We can either look for further data, perform extra preprocessing or use more advanced techniques.

Remember there are lots of techniques we can employ to fix these data.

Whether we should call TV the "best predictor" or label all of them "equally useless", is a domain specific question and a marketing manager would have a better opinion on how to move forward with this situation.

In the following lesson, you'll look at the more details on interpreting the regression diagnostics and confidence in the model.

Summary

In this lab, you ran a complete regression analysis with a simple dataset. You used statsmodel to perform linear regression and evaluated your models using statistical metrics. You also looked for the regression assumptions before and after the analysis phase. Finally, you created some visualizations of your models and checked their goodness of fit.

dsc-ols-statsmodels-lab-seattle-ds-career-040119's People

Contributors

cheffrey2000 avatar loredirick avatar mas16 avatar mathymitchell avatar shakeelraja avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.