In this lab we shall perform a exploratory data analysis task, using statistical and visual EDA skills we have seen so far. We shall continue using the Walmart sales database that we have acquired and cleaned in the previous labs.
You will be able to:
- Check the distribution of various columns
- Examine the descriptive statistics of our data set
- Create visualizations to help us better understand our data set
In the previous lab, we performed some data cleansing and scrubbing activities to create data subset, deal with null values and categorical variables etc. In this lab, we shall perform basic data exploration to help us better understand the distributions of our variables. We shall consider regression assumptions seen earlier to help us during the modeling process.
The dataset for this lab has been taken from our data scrubbing lab, just before we encoded our categorical variables as one hot. This is to keep the number of columns same as original dataset to allow more convenience during exploration.
# You code here
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Store | Dept | Weekly_Sales | IsHoliday | Type | Size | Temperature | Fuel_Price | CPI | Unemployment | binned_markdown_1 | binned_markdown_2 | binned_markdown_3 | binned_markdown_4 | binned_markdown_5 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 24924.50 | False | A | 0.283436 | -1.301205 | -1.56024 | 0.40349 | 0.913194 | NaN | NaN | NaN | NaN | NaN |
1 | 1 | 2 | 50605.27 | False | A | 0.283436 | -1.301205 | -1.56024 | 0.40349 | 0.913194 | NaN | NaN | NaN | NaN | NaN |
2 | 1 | 3 | 13740.12 | False | A | 0.283436 | -1.301205 | -1.56024 | 0.40349 | 0.913194 | NaN | NaN | NaN | NaN | NaN |
3 | 1 | 4 | 39954.04 | False | A | 0.283436 | -1.301205 | -1.56024 | 0.40349 | 0.913194 | NaN | NaN | NaN | NaN | NaN |
4 | 1 | 5 | 32229.38 | False | A | 0.283436 | -1.301205 | -1.56024 | 0.40349 | 0.913194 | NaN | NaN | NaN | NaN | NaN |
# your code here
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Store | Dept | Weekly_Sales | Size | Temperature | Fuel_Price | CPI | Unemployment | |
---|---|---|---|---|---|---|---|---|
count | 97839.000000 | 97839.000000 | 97839.000000 | 9.783900e+04 | 9.783900e+04 | 9.783900e+04 | 9.783900e+04 | 9.783900e+04 |
mean | 5.474545 | 43.318861 | 17223.235591 | -8.044340e-14 | 2.339480e-13 | 4.784098e-13 | -9.181116e-15 | 1.795967e-12 |
std | 2.892364 | 29.673645 | 25288.572553 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 |
min | 1.000000 | 1.000000 | -1098.000000 | -1.611999e+00 | -3.843452e+00 | -1.691961e+00 | -1.958762e+00 | -2.776898e+00 |
25% | 3.000000 | 19.000000 | 2336.485000 | -1.028620e+00 | -7.087592e-01 | -1.053793e+00 | -1.266966e-01 | -6.503157e-01 |
50% | 6.000000 | 36.000000 | 7658.280000 | 2.834360e-01 | 1.340726e-01 | 1.180741e-01 | 4.995210e-01 | -4.621274e-02 |
75% | 8.000000 | 71.000000 | 20851.275000 | 1.113495e+00 | 8.680410e-01 | 8.243739e-01 | 6.346144e-01 | 7.089160e-01 |
max | 10.000000 | 99.000000 | 693099.360000 | 1.171380e+00 | 1.738375e+00 | 2.745691e+00 | 8.517705e-01 | 2.361469e+00 |
# Your observations here
# Your code here
# Your observations here
# Your code here
# State your observations here
Let's use a slightly more advanced plotting technique in seaborn that uses scatter plots, distributions, kde and simple regression line - all in a single go. Its called a jointplot
. Here is the official doc. for this method.
Here is how you would use it:
sns.jointplot(x= <column>, y= <column>, data=<dataset>, kind='reg')
A joint plot will allow us to visually inspect linearity as well as normality assumptions as a single step.
# Your code here
# Provide your observations here
Okie so our key assumptions at this stage don't hold so strong. But that does not mean that should give up and call it a poor dataset. There are lot of pre-processing techniques we can still apply to further clean the data and make it more suitable for modeling.
For building our initial model, we shall use this dataset for a multiple regression experiment and after inspecting the combined effect of all the predictors on the target, we may want to further pre-process the data and take it in for another analytical ride.
The key takeaway here is that we will hardly come across with a real world dataset that meets all our expectations. Another reason to move ahead with this dataset is to ehelp us realize the importance of pre-processing for an improved model building. and we must always remember:
Model development is an iterative process. It hardly ever gets done in the first attempt.
So looking at above, we shall look at some guidelines on model building and validation in upcoming lessons, before we move on to our regression experiment.
Have a look at following resources on how to deal with complex datasets that don't meet our initial expectations.
What to Do When Bad Data Thwarts Machine Learning Success
Practical advice for analysis of large, complex data sets
Data Cleaning Challenge: Scale and Normalize Data
In this lesson we performed some basic EDA on the walmart dataset to check for regression assumptions. Initially our assumptions dont hold very strong but we decided to move ahead with building our first model using this dataset and plan further pre-processing in following iterations.