Pandas 101

This checkpoint contains many of the basic tasks you might need to do with Pandas! At the end of an hour, commit and push what you have (remember, you can always return to this book later for practice)

# Run this import cell without changes

#data manipulation
import pandas as pd

#dataset
from sklearn.datasets import load_boston

#__SOLUTION__
# Run this import cell without changes

#data manipulation
import pandas as pd

#dataset
from sklearn.datasets import load_boston

Loading in the Boston Housing Dataset

# Run this cell without changes
boston = load_boston()

#__SOLUTION__
boston = load_boston()

The variable boston is now a dictionary with several key-value pairs containing different aspects of the Boston Housing dataset.

What are the keys to `boston`?

# Your code here

#__SOLUTION__
boston.keys()

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])

Use the print command to print out the metadata for the dataset contained in the key `DESCR`

# Your code here

#__SOLUTION__
print(boston['DESCR'])

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

Create a dataframe named "df_boston" with data contained in the key `data`. Make the column names of `df_boston` the values from the key `feature_names`

# Your code here

#__SOLUTION__
    
df_boston = pd.DataFrame(boston['data'], columns=boston['feature_names'])

The key target contains the median value of a house.

Add a column named "MEDV" to your dataframe which contains the median value of a house

# Your code here

#__SOLUTION__

df_boston['MEDV'] = boston['target']

Data Exploration

Show the first 5 rows of the dataframe with the `head` method

# Your code here

#__SOLUTION__
df_boston.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2

Show the summary statistics of all columns with the `describe` method

# Your code here

#__SOLUTION__
df_boston.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
count	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000
mean	3.613524	11.363636	11.136779	0.069170	0.554695	6.284634	68.574901	3.795043	9.549407	408.237154	18.455534	356.674032	12.653063	22.532806
std	8.601545	23.322453	6.860353	0.253994	0.115878	0.702617	28.148861	2.105710	8.707259	168.537116	2.164946	91.294864	7.141062	9.197104
min	0.006320	0.000000	0.460000	0.000000	0.385000	3.561000	2.900000	1.129600	1.000000	187.000000	12.600000	0.320000	1.730000	5.000000
25%	0.082045	0.000000	5.190000	0.000000	0.449000	5.885500	45.025000	2.100175	4.000000	279.000000	17.400000	375.377500	6.950000	17.025000
50%	0.256510	0.000000	9.690000	0.000000	0.538000	6.208500	77.500000	3.207450	5.000000	330.000000	19.050000	391.440000	11.360000	21.200000
75%	3.677083	12.500000	18.100000	0.000000	0.624000	6.623500	94.075000	5.188425	24.000000	666.000000	20.200000	396.225000	16.955000	25.000000
max	88.976200	100.000000	27.740000	1.000000	0.871000	8.780000	100.000000	12.126500	24.000000	711.000000	22.000000	396.900000	37.970000	50.000000

Check the datatypes of all columns, and see how many nulls are in each column, using the `info` method

# Your code here

#__SOLUTION__

df_boston.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    float64
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(14)
memory usage: 55.5 KB

Data Selection

Select all values from the column that contains the weighted distances to five Boston employment centres

Hint: you printed out the information about what information variables contain in a cell above

# Your code here

#__SOLUTION__
df_boston['DIS']

0      4.0900
1      4.9671
2      4.9671
3      6.0622
4      6.0622
        ...  
501    2.4786
502    2.2875
503    2.1675
504    2.3889
505    2.5050
Name: DIS, Length: 506, dtype: float64

Select rows 10-20 from the AGE, NOX, and MEDV columns

# Your code here

#__SOLUTION__
df_boston.loc[10:20, ['AGE', 'NOX', 'MEDV']]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	AGE	NOX	MEDV
10	94.3	0.524	15.0
11	82.9	0.524	18.9
12	39.0	0.524	21.7
13	61.8	0.538	20.4
14	84.5	0.538	18.2
15	56.5	0.538	19.9
16	29.3	0.538	23.1
17	81.7	0.538	17.5
18	36.6	0.538	20.2
19	69.5	0.538	18.2
20	98.1	0.538	13.6

Select all rows where NOX is greater than .7 and CRIM is greater than 8

# Your code here

#__SOLUTION__
mask = (
    (df_boston['NOX']>.7) &
    (df_boston['CRIM']>8)
)
df_boston[mask]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	CRIM	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
356	8.98296	18.1	1.0	0.770	6.212	97.4	2.1222	24.0	666.0	20.2	377.73	17.60	17.8
419	11.81230	18.1	0.0	0.718	6.824	76.5	1.7940	24.0	666.0	20.2	48.45	22.74	8.4
420	11.08740	18.1	0.0	0.718	6.411	100.0	1.8589	24.0	666.0	20.2	318.75	15.02	16.7
434	13.91340	18.1	0.0	0.713	6.208	95.0	2.2222	24.0	666.0	20.2	100.63	15.17	11.7
435	11.16040	18.1	0.0	0.740	6.629	94.6	2.1247	24.0	666.0	20.2	109.85	23.27	13.4
436	14.42080	18.1	0.0	0.740	6.461	93.3	2.0026	24.0	666.0	20.2	27.49	18.05	9.6
437	15.17720	18.1	0.0	0.740	6.152	100.0	1.9142	24.0	666.0	20.2	9.32	26.45	8.7
438	13.67810	18.1	0.0	0.740	5.935	87.9	1.8206	24.0	666.0	20.2	68.95	34.02	8.4
439	9.39063	18.1	0.0	0.740	5.627	93.9	1.8172	24.0	666.0	20.2	396.90	22.88	12.8
440	22.05110	18.1	0.0	0.740	5.818	92.4	1.8662	24.0	666.0	20.2	391.45	22.11	10.5
441	9.72418	18.1	0.0	0.740	6.406	97.2	2.0651	24.0	666.0	20.2	385.96	19.52	17.1
443	9.96654	18.1	0.0	0.740	6.485	100.0	1.9784	24.0	666.0	20.2	386.73	18.85	15.4
444	12.80230	18.1	0.0	0.740	5.854	96.6	1.8956	24.0	666.0	20.2	240.52	23.79	10.8
445	10.67180	18.1	0.0	0.740	6.459	94.8	1.9879	24.0	666.0	20.2	43.06	23.98	11.8
447	9.92485	18.1	0.0	0.740	6.251	96.6	2.1980	24.0	666.0	20.2	388.52	16.44	12.6
448	9.32909	18.1	0.0	0.713	6.185	98.7	2.2616	24.0	666.0	20.2	396.90	18.13	14.1
453	8.24809	18.1	0.0	0.713	7.393	99.3	2.4527	24.0	666.0	20.2	375.87	16.74	17.8
454	9.51363	18.1	0.0	0.713	6.728	94.1	2.4961	24.0	666.0	20.2	6.68	18.71	14.9
457	8.20058	18.1	0.0	0.713	5.936	80.3	2.7792	24.0	666.0	20.2	3.50	16.94	13.5

Data Manipulation

Add a column to the dataframe called "MEDV*TAX" which is the product of MEDV and TAX

# Your code here

#__SOLUTION__
df_boston['MEDV*TAX'] = df_boston['MEDV']*df_boston['TAX']

What is the average median value of houses located on the Charles River?

# Your code here

#__SOLUTION__

val = (
    df_boston
    [df_boston['CHAS']==1]
    ['MEDV']
    .mean()
)

val = val*1000
val

28439.999999999996

Write a sentence that answers the above question

# Your written answer here

#__SOLUTION__


'''The average median value of houses located along the Charles River is $28,440'''

learn-co-students / 011121-pandas-practice-one Goto Github PK

011121-pandas-practice-one's Introduction

Pandas 101

Loading in the Boston Housing Dataset

What are the keys to boston?

Use the print command to print out the metadata for the dataset contained in the key DESCR

Create a dataframe named "df_boston" with data contained in the key data. Make the column names of df_boston the values from the key feature_names

Add a column named "MEDV" to your dataframe which contains the median value of a house

Data Exploration

Show the first 5 rows of the dataframe with the head method

Show the summary statistics of all columns with the describe method

Check the datatypes of all columns, and see how many nulls are in each column, using the info method

Data Selection

Select all values from the column that contains the weighted distances to five Boston employment centres

Select rows 10-20 from the AGE, NOX, and MEDV columns

Select all rows where NOX is greater than .7 and CRIM is greater than 8

Data Manipulation

Add a column to the dataframe called "MEDV*TAX" which is the product of MEDV and TAX

What is the average median value of houses located on the Charles River?

Write a sentence that answers the above question

011121-pandas-practice-one's People

Watchers

Recommend Projects

Recommend Topics

Recommend Org

What are the keys to `boston`?

Use the print command to print out the metadata for the dataset contained in the key `DESCR`

Create a dataframe named "df_boston" with data contained in the key `data`. Make the column names of `df_boston` the values from the key `feature_names`

Show the first 5 rows of the dataframe with the `head` method

Show the summary statistics of all columns with the `describe` method

Check the datatypes of all columns, and see how many nulls are in each column, using the `info` method