Questions

all of the questions

Objectives

YWBAT

change functions from 'def' format to 'lambda' format (n/a)
Pandas basics and how to read method chaining
define the word api
Plot important aspects of a pandas dataframe using the pandas api
Create a pivot table in pandas (this will be done on learn.co)

What is pandas? Why do we use it?

In Data Science you have data structures. Examples:

dictionary
list
array
csv file
tuple
excel file
spreadsheets
html files
json files

In order to interact with these files, we can either do it using string manipulation

But now get pandas! Pandas can interact with almost all of these files!

file = open("demo.csv").read()
file

'column1, column2, column3\n0, 1, 2\n1, 2, 3\n3, 4, 5\n6, 7, 8\n9, 10, 11\n'

file_elements = file.replace("\n", ",").split(",")
file_elements

['column1',
 ' column2',
 ' column3',
 '0',
 ' 1',
 ' 2',
 '1',
 ' 2',
 ' 3',
 '3',
 ' 4',
 ' 5',
 '6',
 ' 7',
 ' 8',
 '9',
 ' 10',
 ' 11',
 '']

for index, value in enumerate(file_elements):
    if index%3==0:
        print(value)

column1
0
1
3
6
9

demo_df = pd.read_csv("demo.csv")
demo_df.head(3)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	column1	column2	column3
0	0	1	2
1	1	2	3
2	3	4	5

Activator

Send me the following, in a private zoom chat, please indicate if you're doing level 1 or level 2

Convert this function to a lambda function

Level 1

def f1(x, y, z):
    s = x + y
    return z*s

Level 2

def f1(x, y, z):
    s = x + y
    z = 0.01 if z == 0 else z
    return z*s

Solution

Level 1

f1 = lambda x, y, z: z*(x + y)

Level 2

f1 = lambda x, y, z: z*(x + y) if z !=0 else 0.01*(x + y)

f1 = lambda x, y, z: (0.01 if z == 0 else z) * (x+y)

import numpy as np
import pandas as pd

from collections import defaultdict
from sklearn.datasets import load_boston

import matplotlib.pyplot as plt
import seaborn as sns

boston = load_boston()

data = boston["data"] # call using dictionary notation
target = boston.target # y values
columns = list(boston.feature_names)

data.shape, target.shape

((506, 13), (506,))

# . calls methods and attributes of the object type
data.shape, target.shape

# what does (506,) mean? Numpy is interpreting the array as a vector and not a matrix
# 506 x 1, but the 1 is missing because this is a vector and not a matrix

((506, 13), (506,))

df = pd.DataFrame(data, columns=columns)
df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33

# how do i create a column called target with those nice y values?
df["target"] = target
df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	target
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2

How can we get to know our data?

df.describe()

df.info()

df.info()

# what is this telling us?
# number of entires per column
# data types, in this case float64
# memory size: 55.5 KB
# Object Type -> DataFrame,TimeSeries

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
target     506 non-null float64
dtypes: float64(14)
memory usage: 55.5 KB

df.describe()

# What is this telling us?
# .describe() tells us statistics about the data
# the 'shape' of the data
# 5 point statistics on the data for each column

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	target
count	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000
mean	3.613524	11.363636	11.136779	0.069170	0.554695	6.284634	68.574901	3.795043	9.549407	408.237154	18.455534	356.674032	12.653063	22.532806
std	8.601545	23.322453	6.860353	0.253994	0.115878	0.702617	28.148861	2.105710	8.707259	168.537116	2.164946	91.294864	7.141062	9.197104
min	0.006320	0.000000	0.460000	0.000000	0.385000	3.561000	2.900000	1.129600	1.000000	187.000000	12.600000	0.320000	1.730000	5.000000
25%	0.082045	0.000000	5.190000	0.000000	0.449000	5.885500	45.025000	2.100175	4.000000	279.000000	17.400000	375.377500	6.950000	17.025000
50%	0.256510	0.000000	9.690000	0.000000	0.538000	6.208500	77.500000	3.207450	5.000000	330.000000	19.050000	391.440000	11.360000	21.200000
75%	3.677083	12.500000	18.100000	0.000000	0.624000	6.623500	94.075000	5.188425	24.000000	666.000000	20.200000	396.225000	16.955000	25.000000
max	88.976200	100.000000	27.740000	1.000000	0.871000	8.780000	100.000000	12.126500	24.000000	711.000000	22.000000	396.900000	37.970000	50.000000

df.columns

Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT', 'target'],
      dtype='object')

for col in df.columns:
    plt.figure(figsize=(5, 3))
    plt.hist(df[col], bins=20)
    plt.title(col)
    plt.show()

# let's make some categorical data by creating a new column

# let's make a new column by making a list of the same shape

room_categories = []

for rm in df.RM:
    if rm < 6:
        room_categories.append('small')
    elif rm < 8:
        room_categories.append('medium')
    else:
        room_categories.append('large')

        
df['room_categories'] = room_categories

df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	target	room_categories
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0	medium
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6	medium
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7	medium
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	33.4	medium
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2	medium

### let's count each room type

plt.bar(df["room_categories"].value_counts().index, df["room_categories"].value_counts().values)
plt.title("Bar Chart for Room Categories")
plt.xticks(rotation=75)
plt.show()

create a new column called 'room_age' that is the sum of the room and the age columns

df['room_age'] = df['RM'] + df['AGE']

df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	target	room_categories	room_age
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0	medium	71.775
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6	medium	85.321
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7	medium	68.285
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	33.4	medium	52.798
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2	medium	61.347

let's find the statistics on the indus column

df[['INDUS']].describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	INDUS
count	506.000000
mean	11.136779
std	6.860353
min	0.460000
25%	5.190000
50%	9.690000
75%	18.100000
max	27.740000

let's slice some data

df["AGE"]>50

0       True
1       True
2       True
3      False
4       True
       ...  
501     True
502     True
503     True
504     True
505     True
Name: AGE, Length: 506, dtype: bool

# how do I get only ages greater than 50?
# three ways to do it, I'm going to show you two and I'm going to say which is the best to learn

# first way (not super unreliable, but not the best)
ages_50_plus_df = df[df["AGE"]>50]

ages_50_plus_df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	target	room_categories	room_age
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0	medium	71.775
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6	medium	85.321
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7	medium	68.285
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2	medium	61.347
5	0.02985	0.0	2.18	0.458	6.430	58.7	6.0622	3.0	222.0	18.7	394.12	5.21	28.7	medium	65.130

# second way which is way better and scalable
ages_50_plus_df = df.loc[df["AGE"]>50]
ages_50_plus_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	target	room_categories	room_age
0	0.00632	18.0	2.31	0.0	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0	medium	71.775
1	0.02731	0.0	7.07	0.0	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6	medium	85.321
2	0.02729	0.0	7.07	0.0	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7	medium	68.285
4	0.06905	0.0	2.18	0.0	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2	medium	61.347
5	0.02985	0.0	2.18	0.0	0.458	6.430	58.7	6.0622	3.0	222.0	18.7	394.12	5.21	28.7	medium	65.130
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
501	0.06263	0.0	11.93	0.0	0.573	6.593	69.1	2.4786	1.0	273.0	21.0	391.99	9.67	22.4	medium	75.693
502	0.04527	0.0	11.93	0.0	0.573	6.120	76.7	2.2875	1.0	273.0	21.0	396.90	9.08	20.6	medium	82.820
503	0.06076	0.0	11.93	0.0	0.573	6.976	91.0	2.1675	1.0	273.0	21.0	396.90	5.64	23.9	medium	97.976
504	0.10959	0.0	11.93	0.0	0.573	6.794	89.3	2.3889	1.0	273.0	21.0	393.45	6.48	22.0	medium	96.094
505	0.04741	0.0	11.93	0.0	0.573	6.030	80.8	2.5050	1.0	273.0	21.0	396.90	7.88	11.9	medium	86.830

359 rows × 16 columns

# now let's make a dataframe with ages greater than 50 and rooms greater than 6
df.loc[(df["AGE"]>50) & (df["RM"] > 6) ]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	target	room_categories	room_age
0	0.00632	18.0	2.31	0.0	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0	medium	71.775
1	0.02731	0.0	7.07	0.0	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6	medium	85.321
2	0.02729	0.0	7.07	0.0	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7	medium	68.285
4	0.06905	0.0	2.18	0.0	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2	medium	61.347
5	0.02985	0.0	2.18	0.0	0.458	6.430	58.7	6.0622	3.0	222.0	18.7	394.12	5.21	28.7	medium	65.130
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
501	0.06263	0.0	11.93	0.0	0.573	6.593	69.1	2.4786	1.0	273.0	21.0	391.99	9.67	22.4	medium	75.693
502	0.04527	0.0	11.93	0.0	0.573	6.120	76.7	2.2875	1.0	273.0	21.0	396.90	9.08	20.6	medium	82.820
503	0.06076	0.0	11.93	0.0	0.573	6.976	91.0	2.1675	1.0	273.0	21.0	396.90	5.64	23.9	medium	97.976
504	0.10959	0.0	11.93	0.0	0.573	6.794	89.3	2.3889	1.0	273.0	21.0	393.45	6.48	22.0	medium	96.094
505	0.04741	0.0	11.93	0.0	0.573	6.030	80.8	2.5050	1.0	273.0	21.0	396.90	7.88	11.9	medium	86.830

220 rows × 16 columns

# now let's make a dataframe with ages greater than 50 or rooms greater than 6
df_new = df.loc[(df["AGE"]>50) | (df["RM"] > 6) ]
df_new.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	target	room_categories	room_age
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0	medium	71.775
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6	medium	85.321
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7	medium	68.285
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	33.4	medium	52.798
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2	medium	61.347

df_new[["INDUS", "CHAS", "RAD"]]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	INDUS	CHAS	RAD
0	2.31	0.0	1.0
1	7.07	0.0	2.0
2	7.07	0.0	2.0
3	2.18	0.0	3.0
4	2.18	0.0	3.0
...	...	...	...
501	11.93	0.0	1.0
502	11.93	0.0	1.0
503	11.93	0.0	1.0
504	11.93	0.0	1.0
505	11.93	0.0	1.0

472 rows × 3 columns

# now let's make a dataframe with ages greater than 50 or rooms greater than 6 and let's only grab 
# the CRIM and LSTAT columns

df.loc[(df["AGE"]>50) | (df["RM"] > 6)][["CRIM", "LSTAT"]] # THIS ISN'T PREFERRED

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	CRIM	LSTAT
0	0.00632	4.98
1	0.02731	9.14
2	0.02729	4.03
3	0.03237	2.94
4	0.06905	5.33
...	...	...
501	0.06263	9.67
502	0.04527	9.08
503	0.06076	5.64
504	0.10959	6.48
505	0.04741	7.88

472 rows × 2 columns

df.loc[(df["AGE"]>50) | (df["RM"] > 6), ['CRIM', 'LSTAT']]  # THIS IS THE PREFERRED WAY, USING LOC

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	CRIM	LSTAT
0	0.00632	4.98
1	0.02731	9.14
2	0.02729	4.03
3	0.03237	2.94
4	0.06905	5.33
...	...	...
501	0.06263	9.67
502	0.04527	9.08
503	0.06076	5.64
504	0.10959	6.48
505	0.04741	7.88

472 rows × 2 columns

plt.figure(figsize=(5, 5))
plt.bar()

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-26-7d78820f768d> in <module>
----> 1 df["room_categories"].plot(kind='bar')


~/anaconda3/envs/flatiron-env/lib/python3.6/site-packages/pandas/plotting/_core.py in __call__(self, *args, **kwargs)
    792                     data.columns = label_name
    793 
--> 794         return plot_backend.plot(data, kind=kind, **kwargs)
    795 
    796     def line(self, x=None, y=None, **kwargs):


~/anaconda3/envs/flatiron-env/lib/python3.6/site-packages/pandas/plotting/_matplotlib/__init__.py in plot(data, kind, **kwargs)
     60             kwargs["ax"] = getattr(ax, "left_ax", ax)
     61     plot_obj = PLOT_CLASSES[kind](data, **kwargs)
---> 62     plot_obj.generate()
     63     plot_obj.draw()
     64     return plot_obj.result


~/anaconda3/envs/flatiron-env/lib/python3.6/site-packages/pandas/plotting/_matplotlib/core.py in generate(self)
    277     def generate(self):
    278         self._args_adjust()
--> 279         self._compute_plot_data()
    280         self._setup_subplots()
    281         self._make_plot()


~/anaconda3/envs/flatiron-env/lib/python3.6/site-packages/pandas/plotting/_matplotlib/core.py in _compute_plot_data(self)
    412         # no non-numeric frames or series allowed
    413         if is_empty:
--> 414             raise TypeError("no numeric data to plot")
    415 
    416         # GH25587: cast ExtensionArray of pandas (IntegerArray, etc.) to


TypeError: no numeric data to plot

# pandas slicing
# get dataframe with rows where target < 30

# method 1
# df[df["target"] < 30]


# method 2
# df.loc[df["target"] < 30]

df[["AGE", "ZN"]]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	AGE	ZN
0	65.2	18.0
1	78.9	0.0
2	61.1	0.0
3	45.8	0.0
4	54.2	0.0
...	...	...
501	69.1	0.0
502	76.7	0.0
503	91.0	0.0
504	89.3	0.0
505	80.8	0.0

506 rows × 2 columns

# pandas slicing
# get dataframe with rows where target < 30 but only grab the AGE and ZN columns

# method 1
# df[df["target"] < 30][["AGE", "ZN"]]


# method 2
# df.loc[df["target"] < 30, ["AGE", "ZN"]]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	AGE	ZN
0	65.2	18.0
1	78.9	0.0
5	58.7	0.0
6	66.6	12.5
7	96.1	12.5
...	...	...
501	69.1	0.0
502	76.7	0.0
503	91.0	0.0
504	89.3	0.0
505	80.8	0.0

422 rows × 2 columns

# pandas slicing on multiple conditions
# target < 30 and age > 80

# method 1
# df[(df["target"]<30) & (df["AGE"] > 80)]

# method 2
df.loc[(df["target"]<30) & (df["AGE"]>80)]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	target
7	0.14455	12.5	7.87	0.0	0.524	6.172	96.1	5.9505	5.0	311.0	15.2	396.90	19.15	27.1
8	0.21124	12.5	7.87	0.0	0.524	5.631	100.0	6.0821	5.0	311.0	15.2	386.63	29.93	16.5
9	0.17004	12.5	7.87	0.0	0.524	6.004	85.9	6.5921	5.0	311.0	15.2	386.71	17.10	18.9
10	0.22489	12.5	7.87	0.0	0.524	6.377	94.3	6.3467	5.0	311.0	15.2	392.52	20.45	15.0
11	0.11747	12.5	7.87	0.0	0.524	6.009	82.9	6.2267	5.0	311.0	15.2	396.90	13.27	18.9
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
491	0.10574	0.0	27.74	0.0	0.609	5.983	98.8	1.8681	4.0	711.0	20.1	390.11	18.07	13.6
492	0.11132	0.0	27.74	0.0	0.609	5.983	83.5	2.1099	4.0	711.0	20.1	396.90	13.35	20.1
503	0.06076	0.0	11.93	0.0	0.573	6.976	91.0	2.1675	1.0	273.0	21.0	396.90	5.64	23.9
504	0.10959	0.0	11.93	0.0	0.573	6.794	89.3	2.3889	1.0	273.0	21.0	393.45	6.48	22.0
505	0.04741	0.0	11.93	0.0	0.573	6.030	80.8	2.5050	1.0	273.0	21.0	396.90	7.88	11.9

215 rows × 14 columns

# pandas slicing on mult. conditions for specific columns
# target > 30 and age > 75 but only grab the target and age columns


# method 1
# df[(df["target"] > 30) & (df["AGE"] > 75)][["target", "AGE"]]

# method 2
# df.loc[(df["target"] > 30) & (df["AGE"] > 75), ["target", "AGE"]]

# method 3
# df[["AGE", "target"]][(df["target"]>30) & (df["AGE"]>75)]

# pandas slicing on mult. conditions for specific columns
# target > 30 and age > 75 but only grab the CRIM


df.loc[(df["target"]>30) & (df["AGE"]>75), ["CRIM"]]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	CRIM
97	0.12083
157	1.22358
161	1.46336
162	1.83377
163	1.51902
166	2.01019
180	0.06588
182	0.09103
183	0.10008
223	0.61470
224	0.31533
225	0.52693
226	0.38214
227	0.41238
231	0.46296
257	0.61154
258	0.66351
259	0.65665
260	0.54011
261	0.53412
262	0.52014
263	0.82526
264	0.55007
266	0.78570
368	4.89822
369	5.66998
370	6.53876
371	9.23230
372	8.26725

# Plot a scattermatrix of your dataframe

pd.plotting.scatter_matrix(df, figsize=(20, 20), grid=True, hist_kwds={"bins": 20, "color":"purple"})
plt.show()

# Make a function that plots a specific column of the dataframe as a histogram
# Make the color of it purple by default, alpha value should be 0.8 by default
# Make a parameter to toggle a grid
# Make parameters for axis labels and the title of the histogram
# Make a parameter for number of bins and default it to 20
# call the function `plot_histogram`

def plot_histogram(df, column, bins=20, color="purple", grid=True, title=None, xlabel=None, ylabel="counts"):
    plt.figure(figsize=(8, 5))
    if grid:
        plt.grid(zorder=0)
    plt.hist(x=df[column], bins=bins, color=color, zorder=2)
    if title is None:
        title = f"Histogram for {column}"
    plt.title(title)
    if xlabel is None:
        xlabel = column
    plt.xlabel(xlabel.lower())
    plt.ylabel(ylabel)
    plt.show()

plot_histogram(df, "AGE")

# Plot a hexbin plot of AGE vs Indus colored by target values

df.plot.hexbin(x='AGE', y='CRIM', C='target', gridsize=20, figsize=(8, 5))
plt.show()

df.corr()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	target	room_age
CRIM	1.000000	-0.200469	0.406583	-0.055892	0.420972	-0.219247	0.352734	-0.379670	0.625505	0.582764	0.289946	-0.385064	0.455621	-0.388305	0.349253
ZN	-0.200469	1.000000	-0.533828	-0.042697	-0.516604	0.311991	-0.569537	0.664408	-0.311948	-0.314563	-0.391679	0.175520	-0.412995	0.360445	-0.564971
INDUS	0.406583	-0.533828	1.000000	0.062938	0.763651	-0.391676	0.644779	-0.708027	0.595129	0.720760	0.383248	-0.356977	0.603800	-0.483725	0.638643
CHAS	-0.055892	-0.042697	0.062938	1.000000	0.091203	0.091251	0.086518	-0.099176	-0.007368	-0.035587	-0.121515	0.048788	-0.053929	0.175260	0.089305
NOX	0.420972	-0.516604	0.763651	0.091203	1.000000	-0.302188	0.731470	-0.769230	0.611441	0.668023	0.188933	-0.380051	0.590879	-0.427321	0.728079
RM	-0.219247	0.311991	-0.391676	0.091251	-0.302188	1.000000	-0.240265	0.205246	-0.209847	-0.292048	-0.355501	0.128069	-0.613808	0.695360	-0.216539
AGE	0.352734	-0.569537	0.644779	0.086518	0.731470	-0.240265	1.000000	-0.747881	0.456022	0.506456	0.261515	-0.273534	0.602339	-0.376955	0.999703
DIS	-0.379670	0.664408	-0.708027	-0.099176	-0.769230	0.205246	-0.747881	1.000000	-0.494588	-0.534432	-0.232471	0.291512	-0.496996	0.249929	-0.747017
RAD	0.625505	-0.311948	0.595129	-0.007368	0.611441	-0.209847	0.456022	-0.494588	1.000000	0.910228	0.464741	-0.444413	0.488676	-0.381626	0.453370
TAX	0.582764	-0.314563	0.720760	-0.035587	0.668023	-0.292048	0.506456	-0.534432	0.910228	1.000000	0.460853	-0.441808	0.543993	-0.468536	0.502028
PTRATIO	0.289946	-0.391679	0.383248	-0.121515	0.188933	-0.355501	0.261515	-0.232471	0.464741	0.460853	1.000000	-0.177383	0.374044	-0.507787	0.254090
B	-0.385064	0.175520	-0.356977	0.048788	-0.380051	0.128069	-0.273534	0.291512	-0.444413	-0.441808	-0.177383	1.000000	-0.366087	0.333461	-0.271888
LSTAT	0.455621	-0.412995	0.603800	-0.053929	0.590879	-0.613808	0.602339	-0.496996	0.488676	0.543993	0.374044	-0.366087	1.000000	-0.737663	0.590384
target	-0.388305	0.360445	-0.483725	0.175260	-0.427321	0.695360	-0.376955	0.249929	-0.381626	-0.468536	-0.507787	0.333461	-0.737663	1.000000	-0.361660
room_age	0.349253	-0.564971	0.638643	0.089305	0.728079	-0.216539	0.999703	-0.747017	0.453370	0.502028	0.254090	-0.271888	0.590384	-0.361660	1.000000

# Plot a correlation heatmap using the `.corr()` method and seaborn's heatmap
# Annotate your heatmap using 2 floating points
# Use the 'Blues' color scheme for your heatmap

corr = df.corr()

plt.figure(figsize=(20, 20))
sns.heatmap(corr, annot=True, fmt='0.2g', cmap=sns.color_palette("Purples"))
plt.show()

New Image

demo_df = pd.read_csv("demo.csv")
demo_df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	column1	column2	column3	column4
0	0	1	2	'this'
1	1	2	3	'that'
2	3	4	5	'the other'
3	6	7	8	'the other other'
4	9	10	11	'ant man'

demo_df[" column4"].str.title().str.swapcase()

0                'tHIS'
1                'tHAT'
2           'tHE oTHER'
3     'tHE oTHER oTHER'
4             'aNT mAN'
5                'hULK'
6       'sCARLET wITCH'
Name:  column4, dtype: object

## let's make a new column using a lambda function

df["rooms_rounded"] = df["RM"].apply(lambda x : x//1)

df["rooms_doubled_rounded"] = df["RM"].apply(lambda x : (2*x)//1)


df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	target	room_categories	room_age	rooms_rounded	rooms_doubled_rounded
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0	medium	71.775	6.0	13.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6	medium	85.321	6.0	12.0
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7	medium	68.285	7.0	14.0
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	33.4	medium	52.798	6.0	13.0
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2	medium	61.347	7.0	14.0

Assessment

what is the different between a list object and a numpy.array object?
what is the benefit of using numpy vs writing your own methods?
what is the index of a dataframe? What is a rule for the index? What are columns?
how do we find the mean of a specific column in a dataframe?
plot a hist, scatterplot, lineplot, hexmap, heatmap

erdosn / cv2-mod4-section04-pandas-dataviz-lesson Goto Github PK

cv2-mod4-section04-pandas-dataviz-lesson's Introduction

Questions

Objectives

What is pandas? Why do we use it?

Activator

How can we get to know our data?

create a new column called 'room_age' that is the sum of the room and the age columns

let's find the statistics on the indus column

let's slice some data

New Image

Assessment

cv2-mod4-section04-pandas-dataviz-lesson's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent