In this section you saw you to wrangle and clean some data in Pandas! This will be a baseline skill that you will use consistently in your work whether its doing sanity checks, cleaning messy data or transforming raw datasets into useful aggregates and views. Having an understanding of the format of your data is essential to critically thinking about how you can manipulate and shape it into new and interesting forms.
You will be able to:
- Recall various data cleaning techniques
- Explain appropriate use cases for various data cleaning techniques
We started out by introducing lambda functions. These are quick throw away functions that you can write on the fly. They're very useful for transforming a column feature. For example, you might want to extract the day from a date.
import pandas as pd
dates = pd.Series(['12-01-2017', '12-02-2017', '12-03-2017', '12-04-2017'])
dates.map(lambda x: x.split('-')[1])
0 01
1 02
2 03
3 04
dtype: object
You can combine dataframes by merging them (joining data by a common field) or concatenating them (appending data at the beginning or end).
df1 = pd.DataFrame(dates)
df2 = pd.DataFrame(['12-05-2017', '12-06-2017', '12-07-2017'])
pd.concat([df1, df2])
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | |
---|---|
0 | 12-01-2017 |
1 | 12-02-2017 |
2 | 12-03-2017 |
3 | 12-04-2017 |
0 | 12-05-2017 |
1 | 12-06-2017 |
2 | 12-07-2017 |
df = pd.read_csv('titanic.csv')
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Unnamed: 0 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
grouped = df.groupby(['Pclass', 'Sex'])['Age'].mean().reset_index()
grouped.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Pclass | Sex | Age | |
---|---|---|---|
0 | 1 | female | 34.531646 |
1 | 1 | male | 41.025474 |
2 | 2 | female | 27.757353 |
3 | 2 | male | 30.982234 |
4 | 3 | female | 21.892857 |
pivoted = grouped.pivot(index='Pclass', columns = 'Sex', values='Age')
pivoted
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Sex | female | male |
---|---|---|
Pclass | ||
1 | 34.531646 | 41.025474 |
2 | 27.757353 | 30.982234 |
3 | 21.892857 | 26.437942 |
? | 32.812500 | 32.619048 |
import matplotlib.pyplot as plt
%matplotlib inline
pivoted.plot(kind='barh')
<matplotlib.axes._subplots.AxesSubplot at 0x108982198>
print('Top 5 Values before:\n', df.Cabin.value_counts(normalize=True).reset_index()[:5])
#Not a useful means of imputing in most cases, but a simple example to recap
df.Cabin = df['Cabin'].fillna(value="?")
print('Top 5 Values after:\n', df.Cabin.value_counts(normalize=True).reset_index()[:5])
Top 5 Values before:
index Cabin
0 G6 0.019608
1 C23 C25 C27 0.019608
2 B96 B98 0.019608
3 C22 C26 0.014706
4 E101 0.014706
Top 5 Values after:
index Cabin
0 ? 0.771044
1 B96 B98 0.004489
2 C23 C25 C27 0.004489
3 G6 0.004489
4 F33 0.003367
In this lesson you started practicing essential ETL skills that you will use throughout your data work to transform and wrangle data into useful forms.