Section Recap - Data Cleaning in Pandas

Introduction

In this section you saw you to wrangle and clean some data in Pandas! This will be a baseline skill that you will use consistently in your work whether its doing sanity checks, cleaning messy data or transforming raw datasets into useful aggregates and views. Having an understanding of the format of your data is essential to critically thinking about how you can manipulate and shape it into new and interesting forms.

Objectives

You will be able to:

Recall various data cleaning techniques
Explain appropriate use cases for various data cleaning techniques

Lambda Functions

We started out by introducing lambda functions. These are quick throw away functions that you can write on the fly. They're very useful for transforming a column feature. For example, you might want to extract the day from a date.

import pandas as pd
dates = pd.Series(['12-01-2017', '12-02-2017', '12-03-2017', '12-04-2017'])
dates.map(lambda x: x.split('-')[1])

0    01
1    02
2    03
3    04
dtype: object

Combining DataFrames

You can combine dataframes by merging them (joining data by a common field) or concatenating them (appending data at the beginning or end).

df1 = pd.DataFrame(dates)
df2 = pd.DataFrame(['12-05-2017', '12-06-2017', '12-07-2017'])
pd.concat([df1, df2])

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0
0	12-01-2017
1	12-02-2017
2	12-03-2017
3	12-04-2017
0	12-05-2017
1	12-06-2017
2	12-07-2017

Grouping and Aggregating

df = pd.read_csv('titanic.csv')
df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Unnamed: 0	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

grouped = df.groupby(['Pclass', 'Sex'])['Age'].mean().reset_index()
grouped.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Pclass	Sex	Age
0	1	female	34.531646
1	1	male	41.025474
2	2	female	27.757353
3	2	male	30.982234
4	3	female	21.892857

Pivot Tables

pivoted = grouped.pivot(index='Pclass', columns = 'Sex', values='Age')
pivoted

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

Sex	female	male
Pclass
1	34.531646	41.025474
2	27.757353	30.982234
3	21.892857	26.437942
?	32.812500	32.619048

Graphing

import matplotlib.pyplot as plt
%matplotlib inline
pivoted.plot(kind='barh')

<matplotlib.axes._subplots.AxesSubplot at 0x108982198>

Missing Data

print('Top 5 Values before:\n', df.Cabin.value_counts(normalize=True).reset_index()[:5])
#Not a useful means of imputing in most cases, but a simple example to recap
df.Cabin = df['Cabin'].fillna(value="?")
print('Top 5 Values after:\n', df.Cabin.value_counts(normalize=True).reset_index()[:5])

Top 5 Values before:
          index     Cabin
0           G6  0.019608
1  C23 C25 C27  0.019608
2      B96 B98  0.019608
3      C22 C26  0.014706
4         E101  0.014706
Top 5 Values after:
          index     Cabin
0            ?  0.771044
1      B96 B98  0.004489
2  C23 C25 C27  0.004489
3           G6  0.004489
4          F33  0.003367

Summary

In this lesson you started practicing essential ETL skills that you will use throughout your data work to transform and wrangle data into useful forms.

keita-dc / dsc-summary-data-cleaning-pandas-dc-ds-060319 Goto Github PK

dsc-summary-data-cleaning-pandas-dc-ds-060319's Introduction

Section Recap - Data Cleaning in Pandas

Introduction

Objectives

Lambda Functions

Combining DataFrames

Grouping and Aggregating

Pivot Tables

Graphing

Missing Data

Summary

dsc-summary-data-cleaning-pandas-dc-ds-060319's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent