Coder Social home page Coder Social logo

dsc-summary-data-cleaning-pandas-dc-ds-060319's Introduction

Section Recap - Data Cleaning in Pandas

Introduction

In this section you saw you to wrangle and clean some data in Pandas! This will be a baseline skill that you will use consistently in your work whether its doing sanity checks, cleaning messy data or transforming raw datasets into useful aggregates and views. Having an understanding of the format of your data is essential to critically thinking about how you can manipulate and shape it into new and interesting forms.

Objectives

You will be able to:

  • Recall various data cleaning techniques
  • Explain appropriate use cases for various data cleaning techniques

Lambda Functions

We started out by introducing lambda functions. These are quick throw away functions that you can write on the fly. They're very useful for transforming a column feature. For example, you might want to extract the day from a date.

import pandas as pd
dates = pd.Series(['12-01-2017', '12-02-2017', '12-03-2017', '12-04-2017'])
dates.map(lambda x: x.split('-')[1])
0    01
1    02
2    03
3    04
dtype: object

Combining DataFrames

You can combine dataframes by merging them (joining data by a common field) or concatenating them (appending data at the beginning or end).

df1 = pd.DataFrame(dates)
df2 = pd.DataFrame(['12-05-2017', '12-06-2017', '12-07-2017'])
pd.concat([df1, df2])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0
0 12-01-2017
1 12-02-2017
2 12-03-2017
3 12-04-2017
0 12-05-2017
1 12-06-2017
2 12-07-2017

Grouping and Aggregating

df = pd.read_csv('titanic.csv')
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Unnamed: 0 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
grouped = df.groupby(['Pclass', 'Sex'])['Age'].mean().reset_index()
grouped.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Pclass Sex Age
0 1 female 34.531646
1 1 male 41.025474
2 2 female 27.757353
3 2 male 30.982234
4 3 female 21.892857

Pivot Tables

pivoted = grouped.pivot(index='Pclass', columns = 'Sex', values='Age')
pivoted
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Sex female male
Pclass
1 34.531646 41.025474
2 27.757353 30.982234
3 21.892857 26.437942
? 32.812500 32.619048

Graphing

import matplotlib.pyplot as plt
%matplotlib inline
pivoted.plot(kind='barh')
<matplotlib.axes._subplots.AxesSubplot at 0x108982198>

png

Missing Data

print('Top 5 Values before:\n', df.Cabin.value_counts(normalize=True).reset_index()[:5])
#Not a useful means of imputing in most cases, but a simple example to recap
df.Cabin = df['Cabin'].fillna(value="?")
print('Top 5 Values after:\n', df.Cabin.value_counts(normalize=True).reset_index()[:5])
Top 5 Values before:
          index     Cabin
0           G6  0.019608
1  C23 C25 C27  0.019608
2      B96 B98  0.019608
3      C22 C26  0.014706
4         E101  0.014706
Top 5 Values after:
          index     Cabin
0            ?  0.771044
1      B96 B98  0.004489
2  C23 C25 C27  0.004489
3           G6  0.004489
4          F33  0.003367

Summary

In this lesson you started practicing essential ETL skills that you will use throughout your data work to transform and wrangle data into useful forms.

dsc-summary-data-cleaning-pandas-dc-ds-060319's People

Contributors

mathymitchell avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.