Coder Social home page Coder Social logo

011121-pandas-practice-one's Introduction

Pandas 101

This checkpoint contains many of the basic tasks you might need to do with Pandas! At the end of an hour, commit and push what you have (remember, you can always return to this book later for practice)

# Run this import cell without changes

#data manipulation
import pandas as pd

#dataset
from sklearn.datasets import load_boston
#__SOLUTION__
# Run this import cell without changes

#data manipulation
import pandas as pd

#dataset
from sklearn.datasets import load_boston

Loading in the Boston Housing Dataset

# Run this cell without changes
boston = load_boston()
#__SOLUTION__
boston = load_boston()

The variable boston is now a dictionary with several key-value pairs containing different aspects of the Boston Housing dataset.

What are the keys to boston?

# Your code here
#__SOLUTION__
boston.keys()
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])

Use the print command to print out the metadata for the dataset contained in the key DESCR

# Your code here
#__SOLUTION__
print(boston['DESCR'])
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

Create a dataframe named "df_boston" with data contained in the key data. Make the column names of df_boston the values from the key feature_names

# Your code here
#__SOLUTION__
    
df_boston = pd.DataFrame(boston['data'], columns=boston['feature_names'])

The key target contains the median value of a house.

Add a column named "MEDV" to your dataframe which contains the median value of a house

# Your code here
#__SOLUTION__

df_boston['MEDV'] = boston['target']

Data Exploration

Show the first 5 rows of the dataframe with the head method

# Your code here
#__SOLUTION__
df_boston.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2

Show the summary statistics of all columns with the describe method

# Your code here
#__SOLUTION__
df_boston.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.613524 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 356.674032 12.653063 22.532806
std 8.601545 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 91.294864 7.141062 9.197104
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000 1.730000 5.000000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 375.377500 6.950000 17.025000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 391.440000 11.360000 21.200000
75% 3.677083 12.500000 18.100000 0.000000 0.624000 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 396.225000 16.955000 25.000000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000 37.970000 50.000000

Check the datatypes of all columns, and see how many nulls are in each column, using the info method

# Your code here
#__SOLUTION__

df_boston.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    float64
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(14)
memory usage: 55.5 KB

Data Selection

Select all values from the column that contains the weighted distances to five Boston employment centres

Hint: you printed out the information about what information variables contain in a cell above

# Your code here
#__SOLUTION__
df_boston['DIS']
0      4.0900
1      4.9671
2      4.9671
3      6.0622
4      6.0622
        ...  
501    2.4786
502    2.2875
503    2.1675
504    2.3889
505    2.5050
Name: DIS, Length: 506, dtype: float64

Select rows 10-20 from the AGE, NOX, and MEDV columns

# Your code here
#__SOLUTION__
df_boston.loc[10:20, ['AGE', 'NOX', 'MEDV']]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
AGE NOX MEDV
10 94.3 0.524 15.0
11 82.9 0.524 18.9
12 39.0 0.524 21.7
13 61.8 0.538 20.4
14 84.5 0.538 18.2
15 56.5 0.538 19.9
16 29.3 0.538 23.1
17 81.7 0.538 17.5
18 36.6 0.538 20.2
19 69.5 0.538 18.2
20 98.1 0.538 13.6

Select all rows where NOX is greater than .7 and CRIM is greater than 8

# Your code here
#__SOLUTION__
mask = (
    (df_boston['NOX']>.7) &
    (df_boston['CRIM']>8)
)
df_boston[mask]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
356 8.98296 0.0 18.1 1.0 0.770 6.212 97.4 2.1222 24.0 666.0 20.2 377.73 17.60 17.8
419 11.81230 0.0 18.1 0.0 0.718 6.824 76.5 1.7940 24.0 666.0 20.2 48.45 22.74 8.4
420 11.08740 0.0 18.1 0.0 0.718 6.411 100.0 1.8589 24.0 666.0 20.2 318.75 15.02 16.7
434 13.91340 0.0 18.1 0.0 0.713 6.208 95.0 2.2222 24.0 666.0 20.2 100.63 15.17 11.7
435 11.16040 0.0 18.1 0.0 0.740 6.629 94.6 2.1247 24.0 666.0 20.2 109.85 23.27 13.4
436 14.42080 0.0 18.1 0.0 0.740 6.461 93.3 2.0026 24.0 666.0 20.2 27.49 18.05 9.6
437 15.17720 0.0 18.1 0.0 0.740 6.152 100.0 1.9142 24.0 666.0 20.2 9.32 26.45 8.7
438 13.67810 0.0 18.1 0.0 0.740 5.935 87.9 1.8206 24.0 666.0 20.2 68.95 34.02 8.4
439 9.39063 0.0 18.1 0.0 0.740 5.627 93.9 1.8172 24.0 666.0 20.2 396.90 22.88 12.8
440 22.05110 0.0 18.1 0.0 0.740 5.818 92.4 1.8662 24.0 666.0 20.2 391.45 22.11 10.5
441 9.72418 0.0 18.1 0.0 0.740 6.406 97.2 2.0651 24.0 666.0 20.2 385.96 19.52 17.1
443 9.96654 0.0 18.1 0.0 0.740 6.485 100.0 1.9784 24.0 666.0 20.2 386.73 18.85 15.4
444 12.80230 0.0 18.1 0.0 0.740 5.854 96.6 1.8956 24.0 666.0 20.2 240.52 23.79 10.8
445 10.67180 0.0 18.1 0.0 0.740 6.459 94.8 1.9879 24.0 666.0 20.2 43.06 23.98 11.8
447 9.92485 0.0 18.1 0.0 0.740 6.251 96.6 2.1980 24.0 666.0 20.2 388.52 16.44 12.6
448 9.32909 0.0 18.1 0.0 0.713 6.185 98.7 2.2616 24.0 666.0 20.2 396.90 18.13 14.1
453 8.24809 0.0 18.1 0.0 0.713 7.393 99.3 2.4527 24.0 666.0 20.2 375.87 16.74 17.8
454 9.51363 0.0 18.1 0.0 0.713 6.728 94.1 2.4961 24.0 666.0 20.2 6.68 18.71 14.9
457 8.20058 0.0 18.1 0.0 0.713 5.936 80.3 2.7792 24.0 666.0 20.2 3.50 16.94 13.5

Data Manipulation

Add a column to the dataframe called "MEDV*TAX" which is the product of MEDV and TAX

# Your code here
#__SOLUTION__
df_boston['MEDV*TAX'] = df_boston['MEDV']*df_boston['TAX']

What is the average median value of houses located on the Charles River?

# Your code here
#__SOLUTION__

val = (
    df_boston
    [df_boston['CHAS']==1]
    ['MEDV']
    .mean()
)

val = val*1000
val
28439.999999999996

Write a sentence that answers the above question

# Your written answer here
#__SOLUTION__


'''The average median value of houses located along the Charles River is $28,440'''

011121-pandas-practice-one's People

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.