import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df = pd.read_csv('output.csv')
print(len(df))
df.head()
885548
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Year | Month | State | County | Rate | |
---|---|---|---|---|---|
0 | 2015 | February | Mississippi | Newton County | 6.1 |
1 | 2015 | February | Mississippi | Panola County | 9.4 |
2 | 2015 | February | Mississippi | Monroe County | 7.9 |
3 | 2015 | February | Mississippi | Hinds County | 6.1 |
4 | 2015 | February | Mississippi | Kemper County | 10.6 |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 885548 entries, 0 to 885547
Data columns (total 5 columns):
Year 885548 non-null int64
Month 885548 non-null object
State 885548 non-null object
County 885548 non-null object
Rate 885548 non-null float64
dtypes: float64(1), int64(1), object(3)
memory usage: 33.8+ MB
state_avg = df.groupby('State')['Rate'].mean() #Aggregate the data
state_avg = state_avg.sort_values() #Sort the Aggregation
state_avg.head() #Preview the series
State
Nebraska 3.109903
North Dakota 3.848084
South Dakota 4.097629
Kansas 4.178851
Iowa 4.236744
Name: Rate, dtype: float64
#Create a bar graph from the series
state_avg.plot(kind='barh')
<matplotlib.axes._subplots.AxesSubplot at 0x1e8269b0c88>
import matplotlib.pyplot as plt
plt.figure(figsize=(10,8)) #Manually creates a figure object and specifies the size (will be useful for subplots later too)
state_avg.plot(kind='barh') #Same Visual code as before
plt.title('Average County Unemployment Rates by State', fontsize=16)
plt.xlabel('Average Unemployment Rate') #Add Axis Label (y already labelled)
<matplotlib.text.Text at 0x1e826f139e8>
Again seaborn can also improve the asthetics of our visual.
import seaborn as sns
sns.set_style('darkgrid')
#Same code as above
plt.figure(figsize=(10,8)) #Manually creates a figure object and specifies the size (will be useful for subplots later too)
state_avg.plot(kind='barh') #Same Visual code as before
plt.title('Average County Unemployment Rates by State', fontsize=16)
plt.xlabel('Average Unemployment Rate') #Add Axis Label (y already labelled)
<matplotlib.text.Text at 0x1e82739b0f0>
ny_rates = df[df.State=='New York'].Rate
print(len(ny_rates), type(ny_rates), ny_rates[:5])
20088 <class 'pandas.core.series.Series'> 2453 6.6
2454 6.9
2455 5.8
2456 8.1
2457 7.4
Name: Rate, dtype: float64
plt.boxplot(list(ny_rates))
{'boxes': [<matplotlib.lines.Line2D at 0x1e8282862b0>],
'caps': [<matplotlib.lines.Line2D at 0x1e8282a8710>,
<matplotlib.lines.Line2D at 0x1e8282a8d68>],
'fliers': [<matplotlib.lines.Line2D at 0x1e827739a58>],
'means': [],
'medians': [<matplotlib.lines.Line2D at 0x1e8282a8eb8>],
'whiskers': [<matplotlib.lines.Line2D at 0x1e828286a20>,
<matplotlib.lines.Line2D at 0x1e828286b70>]}
The rectangular box of the box and whisker plot is bounded by the 25th percentile at the bottom, the 75th percentile at the top and the median, the colored line in the middle.
The median is the middle data point in the data; half of the other data points are above it and half of the data points are below it.
#The center line above (currently orange) is the median.
ny_rates.median()
5.8
The top and bottom of the middle rectangle surrounding the median are the upper and lower quartiles These are also known as the 25th percentile and the 75th percentile. They can also be thought of the median of the lower half of the data and the median of the upper half of the data. * 25% of the data falls between the minimum and the 25th percentile * 25% of the data falls between the 25th percentile and the median * 25% of the data falls between the median and the 75th percentile * 25% of the data falls between the 75th percentile and the maximum
print('25th percentile:', ny_rates.quantile(q=.25))
print('75th percentile:', ny_rates.quantile(q=.75))
25th percentile: 4.6
75th percentile: 7.6
The whiskers of the box and whisker plot can be specified in a couple of different manners.
Here's the notes from the docstring (which is also good practice for reading documentation)!
whis : float, sequence, or string (default = 1.5)
As a float, determines the reach of the whiskers to the beyond the first and third quartiles. In other words, where IQR is the interquartile range (Q3-Q1), the upper whisker will extend to last datum less than Q3 + whis*IQR). Similarly, the lower whisker will extend to the first datum greater than Q1 - whis*IQR. Beyond the whiskers, data are considered outliers and are plotted as individual points. Set this to an unreasonably high value to force the whiskers to show the min and max values. Alternatively, set this to an ascending sequence of percentile (e.g., [5, 95]) to set the whiskers at specific percentiles of the data. Finally, whis can be the string 'range' to force the whiskers to the min and max of the data.
#Remember you can pull up the full docstring
plt.boxplot?
plt.boxplot(list(ny_rates), whis=[5,95]) #Whiskers are now set to 5th and 95th percentile rather then outlier metric
{'boxes': [<matplotlib.lines.Line2D at 0x1e82df69240>],
'caps': [<matplotlib.lines.Line2D at 0x1e82df62cc0>,
<matplotlib.lines.Line2D at 0x1e82df62eb8>],
'fliers': [<matplotlib.lines.Line2D at 0x1e82df9bf98>],
'means': [],
'medians': [<matplotlib.lines.Line2D at 0x1e82df9b748>],
'whiskers': [<matplotlib.lines.Line2D at 0x1e82df69be0>,
<matplotlib.lines.Line2D at 0x1e82df69e10>]}
plt.boxplot(list(ny_rates), whis=[1,99]) #Whiskers are now set to 1st and 99th percentile rather then outlier metric
{'boxes': [<matplotlib.lines.Line2D at 0x1e82df24940>],
'caps': [<matplotlib.lines.Line2D at 0x1e82812bda0>,
<matplotlib.lines.Line2D at 0x1e82812bef0>],
'fliers': [<matplotlib.lines.Line2D at 0x1e828132be0>],
'means': [],
'medians': [<matplotlib.lines.Line2D at 0x1e828132588>],
'whiskers': [<matplotlib.lines.Line2D at 0x1e82df24b38>,
<matplotlib.lines.Line2D at 0x1e82812b748>]}
plt.boxplot(list(ny_rates), whis=[5,95], showfliers=False) #Do not display points above/below whiskers
{'boxes': [<matplotlib.lines.Line2D at 0x1e828c47d30>],
'caps': [<matplotlib.lines.Line2D at 0x1e828c557f0>,
<matplotlib.lines.Line2D at 0x1e828c55fd0>],
'fliers': [],
'means': [],
'medians': [<matplotlib.lines.Line2D at 0x1e828c5b278>],
'whiskers': [<matplotlib.lines.Line2D at 0x1e828c47f28>,
<matplotlib.lines.Line2D at 0x1e828c4ef60>]}
ny_rates.min()
1.6000000000000001
ny_rates.max()
18.300000000000001
ny_rates.quantile(q=.1)
3.8
ny_rates.quantile(q=.9)
9.1
ny = df[df.State == 'New York']
ny.head(2)
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Year | Month | State | County | Rate | |
---|---|---|---|---|---|
2453 | 2015 | February | New York | Livingston County | 6.6 |
2454 | 2015 | February | New York | Wayne County | 6.9 |
ny_monthly = ny.groupby(['Year', 'Month'])['Rate'].mean()
ny_monthly.head(2)
Year Month
1990 April 5.470968
August 4.350000
Name: Rate, dtype: float64
ny_monthly.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1e82782f3c8>
Use plt methods like .figure(), .title(), and .ylabel() to improve upon the simple time plot shown above.
# Your code here