This checkpoint contains many of the basic tasks you might need to do with Pandas! At the end of an hour, commit and push what you have (remember, you can always return to this book later for practice)
# Run this import cell without changes
#data manipulation
import pandas as pd
#dataset
from sklearn.datasets import load_boston
#__SOLUTION__
# Run this import cell without changes
#data manipulation
import pandas as pd
#dataset
from sklearn.datasets import load_boston
# Run this cell without changes
boston = load_boston()
#__SOLUTION__
boston = load_boston()
The variable boston
is now a dictionary with several key-value pairs containing different aspects of the Boston Housing dataset.
# Your code here
#__SOLUTION__
boston.keys()
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
# Your code here
#__SOLUTION__
print(boston['DESCR'])
.. _boston_dataset:
Boston house prices dataset
---------------------------
**Data Set Characteristics:**
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression
problems.
.. topic:: References
- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
Create a dataframe named "df_boston" with data contained in the key data
. Make the column names of df_boston
the values from the key feature_names
# Your code here
#__SOLUTION__
df_boston = pd.DataFrame(boston['data'], columns=boston['feature_names'])
The key target
contains the median value of a house.
# Your code here
#__SOLUTION__
df_boston['MEDV'] = boston['target']
# Your code here
#__SOLUTION__
df_boston.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
# Your code here
#__SOLUTION__
df_boston.describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 |
mean | 3.613524 | 11.363636 | 11.136779 | 0.069170 | 0.554695 | 6.284634 | 68.574901 | 3.795043 | 9.549407 | 408.237154 | 18.455534 | 356.674032 | 12.653063 | 22.532806 |
std | 8.601545 | 23.322453 | 6.860353 | 0.253994 | 0.115878 | 0.702617 | 28.148861 | 2.105710 | 8.707259 | 168.537116 | 2.164946 | 91.294864 | 7.141062 | 9.197104 |
min | 0.006320 | 0.000000 | 0.460000 | 0.000000 | 0.385000 | 3.561000 | 2.900000 | 1.129600 | 1.000000 | 187.000000 | 12.600000 | 0.320000 | 1.730000 | 5.000000 |
25% | 0.082045 | 0.000000 | 5.190000 | 0.000000 | 0.449000 | 5.885500 | 45.025000 | 2.100175 | 4.000000 | 279.000000 | 17.400000 | 375.377500 | 6.950000 | 17.025000 |
50% | 0.256510 | 0.000000 | 9.690000 | 0.000000 | 0.538000 | 6.208500 | 77.500000 | 3.207450 | 5.000000 | 330.000000 | 19.050000 | 391.440000 | 11.360000 | 21.200000 |
75% | 3.677083 | 12.500000 | 18.100000 | 0.000000 | 0.624000 | 6.623500 | 94.075000 | 5.188425 | 24.000000 | 666.000000 | 20.200000 | 396.225000 | 16.955000 | 25.000000 |
max | 88.976200 | 100.000000 | 27.740000 | 1.000000 | 0.871000 | 8.780000 | 100.000000 | 12.126500 | 24.000000 | 711.000000 | 22.000000 | 396.900000 | 37.970000 | 50.000000 |
Check the datatypes of all columns, and see how many nulls are in each column, using the info
method
# Your code here
#__SOLUTION__
df_boston.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CRIM 506 non-null float64
1 ZN 506 non-null float64
2 INDUS 506 non-null float64
3 CHAS 506 non-null float64
4 NOX 506 non-null float64
5 RM 506 non-null float64
6 AGE 506 non-null float64
7 DIS 506 non-null float64
8 RAD 506 non-null float64
9 TAX 506 non-null float64
10 PTRATIO 506 non-null float64
11 B 506 non-null float64
12 LSTAT 506 non-null float64
13 MEDV 506 non-null float64
dtypes: float64(14)
memory usage: 55.5 KB
Select all values from the column that contains the weighted distances to five Boston employment centres
Hint: you printed out the information about what information variables contain in a cell above
# Your code here
#__SOLUTION__
df_boston['DIS']
0 4.0900
1 4.9671
2 4.9671
3 6.0622
4 6.0622
...
501 2.4786
502 2.2875
503 2.1675
504 2.3889
505 2.5050
Name: DIS, Length: 506, dtype: float64
# Your code here
#__SOLUTION__
df_boston.loc[10:20, ['AGE', 'NOX', 'MEDV']]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
AGE | NOX | MEDV | |
---|---|---|---|
10 | 94.3 | 0.524 | 15.0 |
11 | 82.9 | 0.524 | 18.9 |
12 | 39.0 | 0.524 | 21.7 |
13 | 61.8 | 0.538 | 20.4 |
14 | 84.5 | 0.538 | 18.2 |
15 | 56.5 | 0.538 | 19.9 |
16 | 29.3 | 0.538 | 23.1 |
17 | 81.7 | 0.538 | 17.5 |
18 | 36.6 | 0.538 | 20.2 |
19 | 69.5 | 0.538 | 18.2 |
20 | 98.1 | 0.538 | 13.6 |
# Your code here
#__SOLUTION__
mask = (
(df_boston['NOX']>.7) &
(df_boston['CRIM']>8)
)
df_boston[mask]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
356 | 8.98296 | 0.0 | 18.1 | 1.0 | 0.770 | 6.212 | 97.4 | 2.1222 | 24.0 | 666.0 | 20.2 | 377.73 | 17.60 | 17.8 |
419 | 11.81230 | 0.0 | 18.1 | 0.0 | 0.718 | 6.824 | 76.5 | 1.7940 | 24.0 | 666.0 | 20.2 | 48.45 | 22.74 | 8.4 |
420 | 11.08740 | 0.0 | 18.1 | 0.0 | 0.718 | 6.411 | 100.0 | 1.8589 | 24.0 | 666.0 | 20.2 | 318.75 | 15.02 | 16.7 |
434 | 13.91340 | 0.0 | 18.1 | 0.0 | 0.713 | 6.208 | 95.0 | 2.2222 | 24.0 | 666.0 | 20.2 | 100.63 | 15.17 | 11.7 |
435 | 11.16040 | 0.0 | 18.1 | 0.0 | 0.740 | 6.629 | 94.6 | 2.1247 | 24.0 | 666.0 | 20.2 | 109.85 | 23.27 | 13.4 |
436 | 14.42080 | 0.0 | 18.1 | 0.0 | 0.740 | 6.461 | 93.3 | 2.0026 | 24.0 | 666.0 | 20.2 | 27.49 | 18.05 | 9.6 |
437 | 15.17720 | 0.0 | 18.1 | 0.0 | 0.740 | 6.152 | 100.0 | 1.9142 | 24.0 | 666.0 | 20.2 | 9.32 | 26.45 | 8.7 |
438 | 13.67810 | 0.0 | 18.1 | 0.0 | 0.740 | 5.935 | 87.9 | 1.8206 | 24.0 | 666.0 | 20.2 | 68.95 | 34.02 | 8.4 |
439 | 9.39063 | 0.0 | 18.1 | 0.0 | 0.740 | 5.627 | 93.9 | 1.8172 | 24.0 | 666.0 | 20.2 | 396.90 | 22.88 | 12.8 |
440 | 22.05110 | 0.0 | 18.1 | 0.0 | 0.740 | 5.818 | 92.4 | 1.8662 | 24.0 | 666.0 | 20.2 | 391.45 | 22.11 | 10.5 |
441 | 9.72418 | 0.0 | 18.1 | 0.0 | 0.740 | 6.406 | 97.2 | 2.0651 | 24.0 | 666.0 | 20.2 | 385.96 | 19.52 | 17.1 |
443 | 9.96654 | 0.0 | 18.1 | 0.0 | 0.740 | 6.485 | 100.0 | 1.9784 | 24.0 | 666.0 | 20.2 | 386.73 | 18.85 | 15.4 |
444 | 12.80230 | 0.0 | 18.1 | 0.0 | 0.740 | 5.854 | 96.6 | 1.8956 | 24.0 | 666.0 | 20.2 | 240.52 | 23.79 | 10.8 |
445 | 10.67180 | 0.0 | 18.1 | 0.0 | 0.740 | 6.459 | 94.8 | 1.9879 | 24.0 | 666.0 | 20.2 | 43.06 | 23.98 | 11.8 |
447 | 9.92485 | 0.0 | 18.1 | 0.0 | 0.740 | 6.251 | 96.6 | 2.1980 | 24.0 | 666.0 | 20.2 | 388.52 | 16.44 | 12.6 |
448 | 9.32909 | 0.0 | 18.1 | 0.0 | 0.713 | 6.185 | 98.7 | 2.2616 | 24.0 | 666.0 | 20.2 | 396.90 | 18.13 | 14.1 |
453 | 8.24809 | 0.0 | 18.1 | 0.0 | 0.713 | 7.393 | 99.3 | 2.4527 | 24.0 | 666.0 | 20.2 | 375.87 | 16.74 | 17.8 |
454 | 9.51363 | 0.0 | 18.1 | 0.0 | 0.713 | 6.728 | 94.1 | 2.4961 | 24.0 | 666.0 | 20.2 | 6.68 | 18.71 | 14.9 |
457 | 8.20058 | 0.0 | 18.1 | 0.0 | 0.713 | 5.936 | 80.3 | 2.7792 | 24.0 | 666.0 | 20.2 | 3.50 | 16.94 | 13.5 |
# Your code here
#__SOLUTION__
df_boston['MEDV*TAX'] = df_boston['MEDV']*df_boston['TAX']
# Your code here
#__SOLUTION__
val = (
df_boston
[df_boston['CHAS']==1]
['MEDV']
.mean()
)
val = val*1000
val
28439.999999999996
# Your written answer here
#__SOLUTION__
'''The average median value of houses located along the Charles River is $28,440'''