Exploring Our Data - Lab

Introduction

In this lab we shall perform a exploratory data analysis task, using statistical and visual EDA skills we have seen so far. We shall continue using the Walmart sales database that we have acquired and cleaned in the previous labs.

Objectives

You will be able to:

Check the distribution of various columns
Examine the descriptive statistics of our data set
Create visualizations to help us better understand our data set

Data Exploration

In the previous lab, we performed some data cleansing and scrubbing activities to create data subset, deal with null values and categorical variables etc. In this lab, we shall perform basic data exploration to help us better understand the distributions of our variables. We shall consider regression assumptions seen earlier to help us during the modeling process.

The dataset for this lab has been taken from our data scrubbing lab, just before we encoded our categorical variables as one hot. This is to keep the number of columns same as original dataset to allow more convenience during exploration.

Load the dataset 'walmart_dataset.csv' as pandas dataframe and check its contents

# You code here

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Store	Dept	Weekly_Sales	IsHoliday	Type	Size	Temperature	Fuel_Price	CPI	Unemployment	binned_markdown_1	binned_markdown_2	binned_markdown_3	binned_markdown_4	binned_markdown_5
0	1	1	24924.50	False	A	0.283436	-1.301205	-1.56024	0.40349	0.913194	NaN	NaN	NaN	NaN	NaN
1	1	2	50605.27	False	A	0.283436	-1.301205	-1.56024	0.40349	0.913194	NaN	NaN	NaN	NaN	NaN
2	1	3	13740.12	False	A	0.283436	-1.301205	-1.56024	0.40349	0.913194	NaN	NaN	NaN	NaN	NaN
3	1	4	39954.04	False	A	0.283436	-1.301205	-1.56024	0.40349	0.913194	NaN	NaN	NaN	NaN	NaN
4	1	5	32229.38	False	A	0.283436	-1.301205	-1.56024	0.40349	0.913194	NaN	NaN	NaN	NaN	NaN

Describe the dataset using 5 point statistics and record your observations

# your code here

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Store	Dept	Weekly_Sales	Size	Temperature	Fuel_Price	CPI	Unemployment
count	97839.000000	97839.000000	97839.000000	9.783900e+04	9.783900e+04	9.783900e+04	9.783900e+04	9.783900e+04
mean	5.474545	43.318861	17223.235591	-8.044340e-14	2.339480e-13	4.784098e-13	-9.181116e-15	1.795967e-12
std	2.892364	29.673645	25288.572553	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00
min	1.000000	1.000000	-1098.000000	-1.611999e+00	-3.843452e+00	-1.691961e+00	-1.958762e+00	-2.776898e+00
25%	3.000000	19.000000	2336.485000	-1.028620e+00	-7.087592e-01	-1.053793e+00	-1.266966e-01	-6.503157e-01
50%	6.000000	36.000000	7658.280000	2.834360e-01	1.340726e-01	1.180741e-01	4.995210e-01	-4.621274e-02
75%	8.000000	71.000000	20851.275000	1.113495e+00	8.680410e-01	8.243739e-01	6.346144e-01	7.089160e-01
max	10.000000	99.000000	693099.360000	1.171380e+00	1.738375e+00	2.745691e+00	8.517705e-01	2.361469e+00

# Your observations here

Use pandas histogram plotting to plot histograms for all the variables in the dataset

# Your code here

# Your observations here

Build normalized histograms with kde plots to explore the distributions further.

Use only the continuous variables in the dataset to plot these visualizations.

# Your code here

# State your observations here

Build joint plots to check for the linearity assumption between predictors and target variable

Let's use a slightly more advanced plotting technique in seaborn that uses scatter plots, distributions, kde and simple regression line - all in a single go. Its called a jointplot. Here is the official doc. for this method.

Here is how you would use it:

sns.jointplot(x= <column>, y= <column>, data=<dataset>, kind='reg')

A joint plot will allow us to visually inspect linearity as well as normality assumptions as a single step.

# Your code here

# Provide your observations here

So Now what ?

Okie so our key assumptions at this stage don't hold so strong. But that does not mean that should give up and call it a poor dataset. There are lot of pre-processing techniques we can still apply to further clean the data and make it more suitable for modeling.

For building our initial model, we shall use this dataset for a multiple regression experiment and after inspecting the combined effect of all the predictors on the target, we may want to further pre-process the data and take it in for another analytical ride.

The key takeaway here is that we will hardly come across with a real world dataset that meets all our expectations. Another reason to move ahead with this dataset is to ehelp us realize the importance of pre-processing for an improved model building. and we must always remember:

Model development is an iterative process. It hardly ever gets done in the first attempt.

So looking at above, we shall look at some guidelines on model building and validation in upcoming lessons, before we move on to our regression experiment.

Summary

In this lesson we performed some basic EDA on the walmart dataset to check for regression assumptions. Initially our assumptions dont hold very strong but we decided to move ahead with building our first model using this dataset and plan further pre-processing in following iterations.

learn-co-students / dsc-1-12-08-exploring-our-data-lab-online-ds-ft-041519 Goto Github PK

dsc-1-12-08-exploring-our-data-lab-online-ds-ft-041519's Introduction

Exploring Our Data - Lab

Introduction

Objectives

Data Exploration

Load the dataset 'walmart_dataset.csv' as pandas dataframe and check its contents

Describe the dataset using 5 point statistics and record your observations

Use pandas histogram plotting to plot histograms for all the variables in the dataset

Build normalized histograms with kde plots to explore the distributions further.

Use only the continuous variables in the dataset to plot these visualizations.

Build joint plots to check for the linearity assumption between predictors and target variable

So Now what ?

Further reading

Summary

dsc-1-12-08-exploring-our-data-lab-online-ds-ft-041519's People

Contributors

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org