import pandas as pd
df = pd.read_csv('cdc_death_stats.csv')
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Notes | State | State Code | Ten-Year Age Groups | Ten-Year Age Groups Code | Gender | Gender Code | Race | Race Code | Deaths | Population | Crude Rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | Alabama | 1 | < 1 year | 1 | Female | F | American Indian or Alaska Native | 1002-5 | 14 | 3579.0 | Unreliable |
1 | NaN | Alabama | 1 | < 1 year | 1 | Female | F | Asian or Pacific Islander | A-PI | 24 | 7443.0 | 322.5 |
2 | NaN | Alabama | 1 | < 1 year | 1 | Female | F | Black or African American | 2054-5 | 2093 | 169339.0 | 1236.0 |
3 | NaN | Alabama | 1 | < 1 year | 1 | Female | F | White | 2106-3 | 2144 | 347921.0 | 616.2 |
4 | NaN | Alabama | 1 | < 1 year | 1 | Male | M | Asian or Pacific Islander | A-PI | 33 | 7366.0 | 448.0 |
type(df)
pandas.core.frame.DataFrame
#Just pandas way of calling columns
#Preview a column (Pandas Series)
df.State.head() #the .head() method works for Series as well!
0 Alabama
1 Alabama
2 Alabama
3 Alabama
4 Alabama
Name: State, dtype: object
#You can only use the above syntax if your column name has no spaces or special characters
#The syntax below always works.
df['State'].tail() #The general form for calling a column
4110 Wyoming
4111 Wyoming
4112 Wyoming
4113 Wyoming
4114 Wyoming
Name: State, dtype: object
df.columns
Index(['Notes', 'State', 'State Code', 'Ten-Year Age Groups',
'Ten-Year Age Groups Code', 'Gender', 'Gender Code', 'Race',
'Race Code', 'Deaths', 'Population', 'Crude Rate'],
dtype='object')
df[df.columns[1:4]].head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
State | State Code | Ten-Year Age Groups | |
---|---|---|---|
0 | Alabama | 1 | < 1 year |
1 | Alabama | 1 | < 1 year |
2 | Alabama | 1 | < 1 year |
3 | Alabama | 1 | < 1 year |
4 | Alabama | 1 | < 1 year |
cols = ['Notes', 'State', 'Population']
df[cols].head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Notes | State | Population | |
---|---|---|---|
0 | NaN | Alabama | 3579 |
1 | NaN | Alabama | 7443 |
2 | NaN | Alabama | 169339 |
3 | NaN | Alabama | 347921 |
4 | NaN | Alabama | 7366 |
df[['Gender', 'Deaths']].head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Gender | Deaths | |
---|---|---|
0 | Female | 14 |
1 | Female | 24 |
2 | Female | 2093 |
3 | Female | 2144 |
4 | Male | 33 |
#Only display data where the State Column is New York and the Deaths column is greater then 50.
ny_50plus = df[(df['State']=='New York')
& (df['Deaths']>50)]
print(len(df))
print(len(ny_50plus))
ny_50plus.head()
4115
82
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Notes | State | State Code | Ten-Year Age Groups | Ten-Year Age Groups Code | Gender | Gender Code | Race | Race Code | Deaths | Population | Crude Rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2606 | NaN | New York | 36 | < 1 year | 1 | Female | F | Asian or Pacific Islander | A-PI | 485 | 168826.0 | 287.3 |
2607 | NaN | New York | 36 | < 1 year | 1 | Female | F | Black or African American | 2054-5 | 3767 | 467735.0 | 805.4 |
2608 | NaN | New York | 36 | < 1 year | 1 | Female | F | White | 2106-3 | 6505 | 1456339.0 | 446.7 |
2610 | NaN | New York | 36 | < 1 year | 1 | Male | M | Asian or Pacific Islander | A-PI | 626 | 179832.0 | 348.1 |
2611 | NaN | New York | 36 | < 1 year | 1 | Male | M | Black or African American | 2054-5 | 4654 | 485909.0 | 957.8 |
#Grouping by a single feature
grouped = df.groupby('State')['Deaths'].sum()
grouped.head()
State
Alabama 860780
Alaska 63334
Arizona 838094
Arkansas 522914
California 4307061
Name: Deaths, dtype: int64
#Grouping by multiple features and reseting the index
grouped = df.groupby(['Gender', 'Race'])['Deaths'].sum().reset_index()
grouped.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Gender | Race | Deaths | |
---|---|---|---|
0 | Female | American Indian or Alaska Native | 120827 |
1 | Female | Asian or Pacific Islander | 417760 |
2 | Female | Black or African American | 2601979 |
3 | Female | White | 19427767 |
4 | Male | American Indian or Alaska Native | 145492 |
Thus far we've primarily worked with the pyplot module within matplotlib.
Also recall the ipython magic command for displaying graphs within notebooks:
import matplotlib.pyplot as plt
%matplotlib inline
# df.Population = df.Population.astype(int)
to_plot = df.groupby('State').Deaths.sum().sort_values(ascending=False)
to_plot.head(2)
State
California 4307061
Florida 3131111
Name: Deaths, dtype: int64
to_plot.head(10).plot(kind='barh')
<matplotlib.axes._subplots.AxesSubplot at 0x10da3d198>
Another very useful package that sits on top of matplotlib is called seaborn. Seaborn helps with figure asthetics and making your graphs by default better styled.
import seaborn as sns
One easy thing to do is change the figure asthetic of all future graphs. You can do this by setting a seaborn style with one line:
sns.set_style('darkgrid')
Then simply rerunning our previous code:
to_plot.head(10).plot(kind='barh')
<matplotlib.axes._subplots.AxesSubplot at 0x1a1aeb1710>
Voila! Notice that nice background thanks to our seaborn style!
Another nice feature are color palettes! Here's a few examples:
current_palette = sns.color_palette() #Save a color palette to a variable
sns.palplot(current_palette) #Preview color palette
sns.palplot(sns.color_palette("Paired"))
sns.palplot(sns.color_palette("Blues"))
And there are many many more! For a more complete description of available color palettes in seaborn check out the documentation here: https://seaborn.pydata.org/tutorial/color_palettes.html
color_palette = sns.color_palette("RdBu_r", 10) #The number reperesents how many colors you want
to_plot.head(10).plot(kind='barh', color = color_palette)
<matplotlib.axes._subplots.AxesSubplot at 0x1a1b4e38d0>