YWBAT
change functions from 'def' format to 'lambda' format (n/a)
Pandas basics and how to read method chaining
define the word api
Plot important aspects of a pandas dataframe using the pandas api
Create a pivot table in pandas (this will be done on learn.co)
What is pandas? Why do we use it?
In Data Science you have data structures.
Examples:
dictionary
list
array
csv file
tuple
excel file
spreadsheets
html files
json files
In order to interact with these files, we can either do it using string manipulation
But now get pandas! Pandas can interact with almost all of these files!
file = open ("demo.csv" ).read ()
file
'column1, column2, column3\n0, 1, 2\n1, 2, 3\n3, 4, 5\n6, 7, 8\n9, 10, 11\n'
file_elements = file .replace ("\n " , "," ).split ("," )
file_elements
['column1',
' column2',
' column3',
'0',
' 1',
' 2',
'1',
' 2',
' 3',
'3',
' 4',
' 5',
'6',
' 7',
' 8',
'9',
' 10',
' 11',
'']
for index , value in enumerate (file_elements ):
if index % 3 == 0 :
print (value )
demo_df = pd .read_csv ("demo.csv" )
demo_df .head (3 )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
column1
column2
column3
0
0
1
2
1
1
2
3
2
3
4
5
Send me the following, in a private zoom chat, please indicate if you're doing level 1 or level 2
Convert this function to a lambda function
Level 1
def f1(x, y, z):
s = x + y
return z*s
Level 2
def f1(x, y, z):
s = x + y
z = 0.01 if z == 0 else z
return z*s
Solution
Level 1
f1 = lambda x, y, z: z*(x + y)
Level 2
f1 = lambda x, y, z: z*(x + y) if z !=0 else 0.01*(x + y)
f1 = lambda x, y, z: (0.01 if z == 0 else z) * (x+y)
import numpy as np
import pandas as pd
from collections import defaultdict
from sklearn .datasets import load_boston
import matplotlib .pyplot as plt
import seaborn as sns
data = boston ["data" ] # call using dictionary notation
target = boston .target # y values
columns = list (boston .feature_names )
# . calls methods and attributes of the object type
data .shape , target .shape
# what does (506,) mean? Numpy is interpreting the array as a vector and not a matrix
# 506 x 1, but the 1 is missing because this is a vector and not a matrix
df = pd .DataFrame (data , columns = columns )
df .head ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
B
LSTAT
0
0.00632
18.0
2.31
0.0
0.538
6.575
65.2
4.0900
1.0
296.0
15.3
396.90
4.98
1
0.02731
0.0
7.07
0.0
0.469
6.421
78.9
4.9671
2.0
242.0
17.8
396.90
9.14
2
0.02729
0.0
7.07
0.0
0.469
7.185
61.1
4.9671
2.0
242.0
17.8
392.83
4.03
3
0.03237
0.0
2.18
0.0
0.458
6.998
45.8
6.0622
3.0
222.0
18.7
394.63
2.94
4
0.06905
0.0
2.18
0.0
0.458
7.147
54.2
6.0622
3.0
222.0
18.7
396.90
5.33
# how do i create a column called target with those nice y values?
df ["target" ] = target
df .head ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
B
LSTAT
target
0
0.00632
18.0
2.31
0.0
0.538
6.575
65.2
4.0900
1.0
296.0
15.3
396.90
4.98
24.0
1
0.02731
0.0
7.07
0.0
0.469
6.421
78.9
4.9671
2.0
242.0
17.8
396.90
9.14
21.6
2
0.02729
0.0
7.07
0.0
0.469
7.185
61.1
4.9671
2.0
242.0
17.8
392.83
4.03
34.7
3
0.03237
0.0
2.18
0.0
0.458
6.998
45.8
6.0622
3.0
222.0
18.7
394.63
2.94
33.4
4
0.06905
0.0
2.18
0.0
0.458
7.147
54.2
6.0622
3.0
222.0
18.7
396.90
5.33
36.2
How can we get to know our data?
df .info ()
# what is this telling us?
# number of entires per column
# data types, in this case float64
# memory size: 55.5 KB
# Object Type -> DataFrame,TimeSeries
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
CRIM 506 non-null float64
ZN 506 non-null float64
INDUS 506 non-null float64
CHAS 506 non-null float64
NOX 506 non-null float64
RM 506 non-null float64
AGE 506 non-null float64
DIS 506 non-null float64
RAD 506 non-null float64
TAX 506 non-null float64
PTRATIO 506 non-null float64
B 506 non-null float64
LSTAT 506 non-null float64
target 506 non-null float64
dtypes: float64(14)
memory usage: 55.5 KB
df .describe ()
# What is this telling us?
# .describe() tells us statistics about the data
# the 'shape' of the data
# 5 point statistics on the data for each column
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
B
LSTAT
target
count
506.000000
506.000000
506.000000
506.000000
506.000000
506.000000
506.000000
506.000000
506.000000
506.000000
506.000000
506.000000
506.000000
506.000000
mean
3.613524
11.363636
11.136779
0.069170
0.554695
6.284634
68.574901
3.795043
9.549407
408.237154
18.455534
356.674032
12.653063
22.532806
std
8.601545
23.322453
6.860353
0.253994
0.115878
0.702617
28.148861
2.105710
8.707259
168.537116
2.164946
91.294864
7.141062
9.197104
min
0.006320
0.000000
0.460000
0.000000
0.385000
3.561000
2.900000
1.129600
1.000000
187.000000
12.600000
0.320000
1.730000
5.000000
25%
0.082045
0.000000
5.190000
0.000000
0.449000
5.885500
45.025000
2.100175
4.000000
279.000000
17.400000
375.377500
6.950000
17.025000
50%
0.256510
0.000000
9.690000
0.000000
0.538000
6.208500
77.500000
3.207450
5.000000
330.000000
19.050000
391.440000
11.360000
21.200000
75%
3.677083
12.500000
18.100000
0.000000
0.624000
6.623500
94.075000
5.188425
24.000000
666.000000
20.200000
396.225000
16.955000
25.000000
max
88.976200
100.000000
27.740000
1.000000
0.871000
8.780000
100.000000
12.126500
24.000000
711.000000
22.000000
396.900000
37.970000
50.000000
Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
'PTRATIO', 'B', 'LSTAT', 'target'],
dtype='object')
for col in df .columns :
plt .figure (figsize = (5 , 3 ))
plt .hist (df [col ], bins = 20 )
plt .title (col )
plt .show ()
# let's make some categorical data by creating a new column
# let's make a new column by making a list of the same shape
room_categories = []
for rm in df .RM :
if rm < 6 :
room_categories .append ('small' )
elif rm < 8 :
room_categories .append ('medium' )
else :
room_categories .append ('large' )
df ['room_categories' ] = room_categories
df .head ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
B
LSTAT
target
room_categories
0
0.00632
18.0
2.31
0.0
0.538
6.575
65.2
4.0900
1.0
296.0
15.3
396.90
4.98
24.0
medium
1
0.02731
0.0
7.07
0.0
0.469
6.421
78.9
4.9671
2.0
242.0
17.8
396.90
9.14
21.6
medium
2
0.02729
0.0
7.07
0.0
0.469
7.185
61.1
4.9671
2.0
242.0
17.8
392.83
4.03
34.7
medium
3
0.03237
0.0
2.18
0.0
0.458
6.998
45.8
6.0622
3.0
222.0
18.7
394.63
2.94
33.4
medium
4
0.06905
0.0
2.18
0.0
0.458
7.147
54.2
6.0622
3.0
222.0
18.7
396.90
5.33
36.2
medium
### let's count each room type
plt .bar (df ["room_categories" ].value_counts ().index , df ["room_categories" ].value_counts ().values )
plt .title ("Bar Chart for Room Categories" )
plt .xticks (rotation = 75 )
plt .show ()
create a new column called 'room_age' that is the sum of the room and the age columns
df ['room_age' ] = df ['RM' ] + df ['AGE' ]
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
B
LSTAT
target
room_categories
room_age
0
0.00632
18.0
2.31
0.0
0.538
6.575
65.2
4.0900
1.0
296.0
15.3
396.90
4.98
24.0
medium
71.775
1
0.02731
0.0
7.07
0.0
0.469
6.421
78.9
4.9671
2.0
242.0
17.8
396.90
9.14
21.6
medium
85.321
2
0.02729
0.0
7.07
0.0
0.469
7.185
61.1
4.9671
2.0
242.0
17.8
392.83
4.03
34.7
medium
68.285
3
0.03237
0.0
2.18
0.0
0.458
6.998
45.8
6.0622
3.0
222.0
18.7
394.63
2.94
33.4
medium
52.798
4
0.06905
0.0
2.18
0.0
0.458
7.147
54.2
6.0622
3.0
222.0
18.7
396.90
5.33
36.2
medium
61.347
let's find the statistics on the indus column
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
INDUS
count
506.000000
mean
11.136779
std
6.860353
min
0.460000
25%
5.190000
50%
9.690000
75%
18.100000
max
27.740000
0 True
1 True
2 True
3 False
4 True
...
501 True
502 True
503 True
504 True
505 True
Name: AGE, Length: 506, dtype: bool
# how do I get only ages greater than 50?
# three ways to do it, I'm going to show you two and I'm going to say which is the best to learn
# first way (not super unreliable, but not the best)
ages_50_plus_df = df [df ["AGE" ]> 50 ]
ages_50_plus_df .head ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
B
LSTAT
target
room_categories
room_age
0
0.00632
18.0
2.31
0.0
0.538
6.575
65.2
4.0900
1.0
296.0
15.3
396.90
4.98
24.0
medium
71.775
1
0.02731
0.0
7.07
0.0
0.469
6.421
78.9
4.9671
2.0
242.0
17.8
396.90
9.14
21.6
medium
85.321
2
0.02729
0.0
7.07
0.0
0.469
7.185
61.1
4.9671
2.0
242.0
17.8
392.83
4.03
34.7
medium
68.285
4
0.06905
0.0
2.18
0.0
0.458
7.147
54.2
6.0622
3.0
222.0
18.7
396.90
5.33
36.2
medium
61.347
5
0.02985
0.0
2.18
0.0
0.458
6.430
58.7
6.0622
3.0
222.0
18.7
394.12
5.21
28.7
medium
65.130
# second way which is way better and scalable
ages_50_plus_df = df .loc [df ["AGE" ]> 50 ]
ages_50_plus_df
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
B
LSTAT
target
room_categories
room_age
0
0.00632
18.0
2.31
0.0
0.538
6.575
65.2
4.0900
1.0
296.0
15.3
396.90
4.98
24.0
medium
71.775
1
0.02731
0.0
7.07
0.0
0.469
6.421
78.9
4.9671
2.0
242.0
17.8
396.90
9.14
21.6
medium
85.321
2
0.02729
0.0
7.07
0.0
0.469
7.185
61.1
4.9671
2.0
242.0
17.8
392.83
4.03
34.7
medium
68.285
4
0.06905
0.0
2.18
0.0
0.458
7.147
54.2
6.0622
3.0
222.0
18.7
396.90
5.33
36.2
medium
61.347
5
0.02985
0.0
2.18
0.0
0.458
6.430
58.7
6.0622
3.0
222.0
18.7
394.12
5.21
28.7
medium
65.130
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
501
0.06263
0.0
11.93
0.0
0.573
6.593
69.1
2.4786
1.0
273.0
21.0
391.99
9.67
22.4
medium
75.693
502
0.04527
0.0
11.93
0.0
0.573
6.120
76.7
2.2875
1.0
273.0
21.0
396.90
9.08
20.6
medium
82.820
503
0.06076
0.0
11.93
0.0
0.573
6.976
91.0
2.1675
1.0
273.0
21.0
396.90
5.64
23.9
medium
97.976
504
0.10959
0.0
11.93
0.0
0.573
6.794
89.3
2.3889
1.0
273.0
21.0
393.45
6.48
22.0
medium
96.094
505
0.04741
0.0
11.93
0.0
0.573
6.030
80.8
2.5050
1.0
273.0
21.0
396.90
7.88
11.9
medium
86.830
359 rows × 16 columns
# now let's make a dataframe with ages greater than 50 and rooms greater than 6
df .loc [(df ["AGE" ]> 50 ) & (df ["RM" ] > 6 ) ]
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
B
LSTAT
target
room_categories
room_age
0
0.00632
18.0
2.31
0.0
0.538
6.575
65.2
4.0900
1.0
296.0
15.3
396.90
4.98
24.0
medium
71.775
1
0.02731
0.0
7.07
0.0
0.469
6.421
78.9
4.9671
2.0
242.0
17.8
396.90
9.14
21.6
medium
85.321
2
0.02729
0.0
7.07
0.0
0.469
7.185
61.1
4.9671
2.0
242.0
17.8
392.83
4.03
34.7
medium
68.285
4
0.06905
0.0
2.18
0.0
0.458
7.147
54.2
6.0622
3.0
222.0
18.7
396.90
5.33
36.2
medium
61.347
5
0.02985
0.0
2.18
0.0
0.458
6.430
58.7
6.0622
3.0
222.0
18.7
394.12
5.21
28.7
medium
65.130
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
501
0.06263
0.0
11.93
0.0
0.573
6.593
69.1
2.4786
1.0
273.0
21.0
391.99
9.67
22.4
medium
75.693
502
0.04527
0.0
11.93
0.0
0.573
6.120
76.7
2.2875
1.0
273.0
21.0
396.90
9.08
20.6
medium
82.820
503
0.06076
0.0
11.93
0.0
0.573
6.976
91.0
2.1675
1.0
273.0
21.0
396.90
5.64
23.9
medium
97.976
504
0.10959
0.0
11.93
0.0
0.573
6.794
89.3
2.3889
1.0
273.0
21.0
393.45
6.48
22.0
medium
96.094
505
0.04741
0.0
11.93
0.0
0.573
6.030
80.8
2.5050
1.0
273.0
21.0
396.90
7.88
11.9
medium
86.830
220 rows × 16 columns
# now let's make a dataframe with ages greater than 50 or rooms greater than 6
df_new = df .loc [(df ["AGE" ]> 50 ) | (df ["RM" ] > 6 ) ]
df_new .head ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
B
LSTAT
target
room_categories
room_age
0
0.00632
18.0
2.31
0.0
0.538
6.575
65.2
4.0900
1.0
296.0
15.3
396.90
4.98
24.0
medium
71.775
1
0.02731
0.0
7.07
0.0
0.469
6.421
78.9
4.9671
2.0
242.0
17.8
396.90
9.14
21.6
medium
85.321
2
0.02729
0.0
7.07
0.0
0.469
7.185
61.1
4.9671
2.0
242.0
17.8
392.83
4.03
34.7
medium
68.285
3
0.03237
0.0
2.18
0.0
0.458
6.998
45.8
6.0622
3.0
222.0
18.7
394.63
2.94
33.4
medium
52.798
4
0.06905
0.0
2.18
0.0
0.458
7.147
54.2
6.0622
3.0
222.0
18.7
396.90
5.33
36.2
medium
61.347
df_new [["INDUS" , "CHAS" , "RAD" ]]
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
INDUS
CHAS
RAD
0
2.31
0.0
1.0
1
7.07
0.0
2.0
2
7.07
0.0
2.0
3
2.18
0.0
3.0
4
2.18
0.0
3.0
...
...
...
...
501
11.93
0.0
1.0
502
11.93
0.0
1.0
503
11.93
0.0
1.0
504
11.93
0.0
1.0
505
11.93
0.0
1.0
472 rows × 3 columns
# now let's make a dataframe with ages greater than 50 or rooms greater than 6 and let's only grab
# the CRIM and LSTAT columns
df .loc [(df ["AGE" ]> 50 ) | (df ["RM" ] > 6 )][["CRIM" , "LSTAT" ]] # THIS ISN'T PREFERRED
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
CRIM
LSTAT
0
0.00632
4.98
1
0.02731
9.14
2
0.02729
4.03
3
0.03237
2.94
4
0.06905
5.33
...
...
...
501
0.06263
9.67
502
0.04527
9.08
503
0.06076
5.64
504
0.10959
6.48
505
0.04741
7.88
472 rows × 2 columns
df .loc [(df ["AGE" ]> 50 ) | (df ["RM" ] > 6 ), ['CRIM' , 'LSTAT' ]] # THIS IS THE PREFERRED WAY, USING LOC
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
CRIM
LSTAT
0
0.00632
4.98
1
0.02731
9.14
2
0.02729
4.03
3
0.03237
2.94
4
0.06905
5.33
...
...
...
501
0.06263
9.67
502
0.04527
9.08
503
0.06076
5.64
504
0.10959
6.48
505
0.04741
7.88
472 rows × 2 columns
plt .figure (figsize = (5 , 5 ))
plt .bar ()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-26-7d78820f768d> in <module>
----> 1 df["room_categories"].plot(kind='bar')
~/anaconda3/envs/flatiron-env/lib/python3.6/site-packages/pandas/plotting/_core.py in __call__(self, *args, **kwargs)
792 data.columns = label_name
793
--> 794 return plot_backend.plot(data, kind=kind, **kwargs)
795
796 def line(self, x=None, y=None, **kwargs):
~/anaconda3/envs/flatiron-env/lib/python3.6/site-packages/pandas/plotting/_matplotlib/__init__.py in plot(data, kind, **kwargs)
60 kwargs["ax"] = getattr(ax, "left_ax", ax)
61 plot_obj = PLOT_CLASSES[kind](data, **kwargs)
---> 62 plot_obj.generate()
63 plot_obj.draw()
64 return plot_obj.result
~/anaconda3/envs/flatiron-env/lib/python3.6/site-packages/pandas/plotting/_matplotlib/core.py in generate(self)
277 def generate(self):
278 self._args_adjust()
--> 279 self._compute_plot_data()
280 self._setup_subplots()
281 self._make_plot()
~/anaconda3/envs/flatiron-env/lib/python3.6/site-packages/pandas/plotting/_matplotlib/core.py in _compute_plot_data(self)
412 # no non-numeric frames or series allowed
413 if is_empty:
--> 414 raise TypeError("no numeric data to plot")
415
416 # GH25587: cast ExtensionArray of pandas (IntegerArray, etc.) to
TypeError: no numeric data to plot
# pandas slicing
# get dataframe with rows where target < 30
# method 1
# df[df["target"] < 30]
# method 2
# df.loc[df["target"] < 30]
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
AGE
ZN
0
65.2
18.0
1
78.9
0.0
2
61.1
0.0
3
45.8
0.0
4
54.2
0.0
...
...
...
501
69.1
0.0
502
76.7
0.0
503
91.0
0.0
504
89.3
0.0
505
80.8
0.0
506 rows × 2 columns
# pandas slicing
# get dataframe with rows where target < 30 but only grab the AGE and ZN columns
# method 1
# df[df["target"] < 30][["AGE", "ZN"]]
# method 2
# df.loc[df["target"] < 30, ["AGE", "ZN"]]
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
AGE
ZN
0
65.2
18.0
1
78.9
0.0
5
58.7
0.0
6
66.6
12.5
7
96.1
12.5
...
...
...
501
69.1
0.0
502
76.7
0.0
503
91.0
0.0
504
89.3
0.0
505
80.8
0.0
422 rows × 2 columns
# pandas slicing on multiple conditions
# target < 30 and age > 80
# method 1
# df[(df["target"]<30) & (df["AGE"] > 80)]
# method 2
df .loc [(df ["target" ]< 30 ) & (df ["AGE" ]> 80 )]
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
B
LSTAT
target
7
0.14455
12.5
7.87
0.0
0.524
6.172
96.1
5.9505
5.0
311.0
15.2
396.90
19.15
27.1
8
0.21124
12.5
7.87
0.0
0.524
5.631
100.0
6.0821
5.0
311.0
15.2
386.63
29.93
16.5
9
0.17004
12.5
7.87
0.0
0.524
6.004
85.9
6.5921
5.0
311.0
15.2
386.71
17.10
18.9
10
0.22489
12.5
7.87
0.0
0.524
6.377
94.3
6.3467
5.0
311.0
15.2
392.52
20.45
15.0
11
0.11747
12.5
7.87
0.0
0.524
6.009
82.9
6.2267
5.0
311.0
15.2
396.90
13.27
18.9
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
491
0.10574
0.0
27.74
0.0
0.609
5.983
98.8
1.8681
4.0
711.0
20.1
390.11
18.07
13.6
492
0.11132
0.0
27.74
0.0
0.609
5.983
83.5
2.1099
4.0
711.0
20.1
396.90
13.35
20.1
503
0.06076
0.0
11.93
0.0
0.573
6.976
91.0
2.1675
1.0
273.0
21.0
396.90
5.64
23.9
504
0.10959
0.0
11.93
0.0
0.573
6.794
89.3
2.3889
1.0
273.0
21.0
393.45
6.48
22.0
505
0.04741
0.0
11.93
0.0
0.573
6.030
80.8
2.5050
1.0
273.0
21.0
396.90
7.88
11.9
215 rows × 14 columns
# pandas slicing on mult. conditions for specific columns
# target > 30 and age > 75 but only grab the target and age columns
# method 1
# df[(df["target"] > 30) & (df["AGE"] > 75)][["target", "AGE"]]
# method 2
# df.loc[(df["target"] > 30) & (df["AGE"] > 75), ["target", "AGE"]]
# method 3
# df[["AGE", "target"]][(df["target"]>30) & (df["AGE"]>75)]
# pandas slicing on mult. conditions for specific columns
# target > 30 and age > 75 but only grab the CRIM
df .loc [(df ["target" ]> 30 ) & (df ["AGE" ]> 75 ), ["CRIM" ]]
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
CRIM
97
0.12083
157
1.22358
161
1.46336
162
1.83377
163
1.51902
166
2.01019
180
0.06588
182
0.09103
183
0.10008
223
0.61470
224
0.31533
225
0.52693
226
0.38214
227
0.41238
231
0.46296
257
0.61154
258
0.66351
259
0.65665
260
0.54011
261
0.53412
262
0.52014
263
0.82526
264
0.55007
266
0.78570
368
4.89822
369
5.66998
370
6.53876
371
9.23230
372
8.26725
# Plot a scattermatrix of your dataframe
pd .plotting .scatter_matrix (df , figsize = (20 , 20 ), grid = True , hist_kwds = {"bins" : 20 , "color" :"purple" })
plt .show ()
# Make a function that plots a specific column of the dataframe as a histogram
# Make the color of it purple by default, alpha value should be 0.8 by default
# Make a parameter to toggle a grid
# Make parameters for axis labels and the title of the histogram
# Make a parameter for number of bins and default it to 20
# call the function `plot_histogram`
def plot_histogram (df , column , bins = 20 , color = "purple" , grid = True , title = None , xlabel = None , ylabel = "counts" ):
plt .figure (figsize = (8 , 5 ))
if grid :
plt .grid (zorder = 0 )
plt .hist (x = df [column ], bins = bins , color = color , zorder = 2 )
if title is None :
title = f"Histogram for { column } "
plt .title (title )
if xlabel is None :
xlabel = column
plt .xlabel (xlabel .lower ())
plt .ylabel (ylabel )
plt .show ()
plot_histogram (df , "AGE" )
# Plot a hexbin plot of AGE vs Indus colored by target values
df .plot .hexbin (x = 'AGE' , y = 'CRIM' , C = 'target' , gridsize = 20 , figsize = (8 , 5 ))
plt .show ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
B
LSTAT
target
room_age
CRIM
1.000000
-0.200469
0.406583
-0.055892
0.420972
-0.219247
0.352734
-0.379670
0.625505
0.582764
0.289946
-0.385064
0.455621
-0.388305
0.349253
ZN
-0.200469
1.000000
-0.533828
-0.042697
-0.516604
0.311991
-0.569537
0.664408
-0.311948
-0.314563
-0.391679
0.175520
-0.412995
0.360445
-0.564971
INDUS
0.406583
-0.533828
1.000000
0.062938
0.763651
-0.391676
0.644779
-0.708027
0.595129
0.720760
0.383248
-0.356977
0.603800
-0.483725
0.638643
CHAS
-0.055892
-0.042697
0.062938
1.000000
0.091203
0.091251
0.086518
-0.099176
-0.007368
-0.035587
-0.121515
0.048788
-0.053929
0.175260
0.089305
NOX
0.420972
-0.516604
0.763651
0.091203
1.000000
-0.302188
0.731470
-0.769230
0.611441
0.668023
0.188933
-0.380051
0.590879
-0.427321
0.728079
RM
-0.219247
0.311991
-0.391676
0.091251
-0.302188
1.000000
-0.240265
0.205246
-0.209847
-0.292048
-0.355501
0.128069
-0.613808
0.695360
-0.216539
AGE
0.352734
-0.569537
0.644779
0.086518
0.731470
-0.240265
1.000000
-0.747881
0.456022
0.506456
0.261515
-0.273534
0.602339
-0.376955
0.999703
DIS
-0.379670
0.664408
-0.708027
-0.099176
-0.769230
0.205246
-0.747881
1.000000
-0.494588
-0.534432
-0.232471
0.291512
-0.496996
0.249929
-0.747017
RAD
0.625505
-0.311948
0.595129
-0.007368
0.611441
-0.209847
0.456022
-0.494588
1.000000
0.910228
0.464741
-0.444413
0.488676
-0.381626
0.453370
TAX
0.582764
-0.314563
0.720760
-0.035587
0.668023
-0.292048
0.506456
-0.534432
0.910228
1.000000
0.460853
-0.441808
0.543993
-0.468536
0.502028
PTRATIO
0.289946
-0.391679
0.383248
-0.121515
0.188933
-0.355501
0.261515
-0.232471
0.464741
0.460853
1.000000
-0.177383
0.374044
-0.507787
0.254090
B
-0.385064
0.175520
-0.356977
0.048788
-0.380051
0.128069
-0.273534
0.291512
-0.444413
-0.441808
-0.177383
1.000000
-0.366087
0.333461
-0.271888
LSTAT
0.455621
-0.412995
0.603800
-0.053929
0.590879
-0.613808
0.602339
-0.496996
0.488676
0.543993
0.374044
-0.366087
1.000000
-0.737663
0.590384
target
-0.388305
0.360445
-0.483725
0.175260
-0.427321
0.695360
-0.376955
0.249929
-0.381626
-0.468536
-0.507787
0.333461
-0.737663
1.000000
-0.361660
room_age
0.349253
-0.564971
0.638643
0.089305
0.728079
-0.216539
0.999703
-0.747017
0.453370
0.502028
0.254090
-0.271888
0.590384
-0.361660
1.000000
# Plot a correlation heatmap using the `.corr()` method and seaborn's heatmap
# Annotate your heatmap using 2 floating points
# Use the 'Blues' color scheme for your heatmap
corr = df .corr ()
plt .figure (figsize = (20 , 20 ))
sns .heatmap (corr , annot = True , fmt = '0.2g' , cmap = sns .color_palette ("Purples" ))
plt .show ()
demo_df = pd .read_csv ("demo.csv" )
demo_df .head ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
column1
column2
column3
column4
0
0
1
2
'this'
1
1
2
3
'that'
2
3
4
5
'the other'
3
6
7
8
'the other other'
4
9
10
11
'ant man'
demo_df [" column4" ].str .title ().str .swapcase ()
0 'tHIS'
1 'tHAT'
2 'tHE oTHER'
3 'tHE oTHER oTHER'
4 'aNT mAN'
5 'hULK'
6 'sCARLET wITCH'
Name: column4, dtype: object
## let's make a new column using a lambda function
df ["rooms_rounded" ] = df ["RM" ].apply (lambda x : x // 1 )
df ["rooms_doubled_rounded" ] = df ["RM" ].apply (lambda x : (2 * x )// 1 )
df .head ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
B
LSTAT
target
room_categories
room_age
rooms_rounded
rooms_doubled_rounded
0
0.00632
18.0
2.31
0.0
0.538
6.575
65.2
4.0900
1.0
296.0
15.3
396.90
4.98
24.0
medium
71.775
6.0
13.0
1
0.02731
0.0
7.07
0.0
0.469
6.421
78.9
4.9671
2.0
242.0
17.8
396.90
9.14
21.6
medium
85.321
6.0
12.0
2
0.02729
0.0
7.07
0.0
0.469
7.185
61.1
4.9671
2.0
242.0
17.8
392.83
4.03
34.7
medium
68.285
7.0
14.0
3
0.03237
0.0
2.18
0.0
0.458
6.998
45.8
6.0622
3.0
222.0
18.7
394.63
2.94
33.4
medium
52.798
6.0
13.0
4
0.06905
0.0
2.18
0.0
0.458
7.147
54.2
6.0622
3.0
222.0
18.7
396.90
5.33
36.2
medium
61.347
7.0
14.0
what is the different between a list object and a numpy.array object?
what is the benefit of using numpy vs writing your own methods?
what is the index of a dataframe? What is a rule for the index? What are columns?
how do we find the mean of a specific column in a dataframe?
plot a hist, scatterplot, lineplot, hexmap, heatmap