Coder Social home page Coder Social logo

cv2-mod4-section04-pandas-dataviz-lesson's Introduction

Questions

  • all of the questions

Objectives

YWBAT

  • change functions from 'def' format to 'lambda' format (n/a)
  • Pandas basics and how to read method chaining
  • define the word api
  • Plot important aspects of a pandas dataframe using the pandas api
  • Create a pivot table in pandas (this will be done on learn.co)

What is pandas? Why do we use it?

In Data Science you have data structures. Examples:

  • dictionary
  • list
  • array
  • csv file
  • tuple
  • excel file
  • spreadsheets
  • html files
  • json files

In order to interact with these files, we can either do it using string manipulation

But now get pandas! Pandas can interact with almost all of these files!

file = open("demo.csv").read()
file
'column1, column2, column3\n0, 1, 2\n1, 2, 3\n3, 4, 5\n6, 7, 8\n9, 10, 11\n'
file_elements = file.replace("\n", ",").split(",")
file_elements
['column1',
 ' column2',
 ' column3',
 '0',
 ' 1',
 ' 2',
 '1',
 ' 2',
 ' 3',
 '3',
 ' 4',
 ' 5',
 '6',
 ' 7',
 ' 8',
 '9',
 ' 10',
 ' 11',
 '']
for index, value in enumerate(file_elements):
    if index%3==0:
        print(value)
column1
0
1
3
6
9
demo_df = pd.read_csv("demo.csv")
demo_df.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
column1 column2 column3
0 0 1 2
1 1 2 3
2 3 4 5

Activator

Send me the following, in a private zoom chat, please indicate if you're doing level 1 or level 2

Convert this function to a lambda function

Level 1

def f1(x, y, z):
    s = x + y
    return z*s

Level 2

def f1(x, y, z):
    s = x + y
    z = 0.01 if z == 0 else z
    return z*s
Solution

Level 1

f1 = lambda x, y, z: z*(x + y)

Level 2

f1 = lambda x, y, z: z*(x + y) if z !=0 else 0.01*(x + y)

f1 = lambda x, y, z: (0.01 if z == 0 else z) * (x+y)
import numpy as np
import pandas as pd

from collections import defaultdict
from sklearn.datasets import load_boston

import matplotlib.pyplot as plt
import seaborn as sns
boston = load_boston()
data = boston["data"] # call using dictionary notation
target = boston.target # y values
columns = list(boston.feature_names)
data.shape, target.shape
((506, 13), (506,))
# . calls methods and attributes of the object type
data.shape, target.shape

# what does (506,) mean? Numpy is interpreting the array as a vector and not a matrix
# 506 x 1, but the 1 is missing because this is a vector and not a matrix
((506, 13), (506,))
df = pd.DataFrame(data, columns=columns)
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33
# how do i create a column called target with those nice y values?
df["target"] = target
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2

How can we get to know our data?

df.describe()

df.info()
df.info()

# what is this telling us?
# number of entires per column
# data types, in this case float64
# memory size: 55.5 KB
# Object Type -> DataFrame,TimeSeries
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
target     506 non-null float64
dtypes: float64(14)
memory usage: 55.5 KB
df.describe()

# What is this telling us?
# .describe() tells us statistics about the data
# the 'shape' of the data
# 5 point statistics on the data for each column
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.613524 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 356.674032 12.653063 22.532806
std 8.601545 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 91.294864 7.141062 9.197104
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000 1.730000 5.000000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 375.377500 6.950000 17.025000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 391.440000 11.360000 21.200000
75% 3.677083 12.500000 18.100000 0.000000 0.624000 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 396.225000 16.955000 25.000000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000 37.970000 50.000000
df.columns
Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT', 'target'],
      dtype='object')
for col in df.columns:
    plt.figure(figsize=(5, 3))
    plt.hist(df[col], bins=20)
    plt.title(col)
    plt.show()

png

png

png

png

png

png

png

png

png

png

png

png

png

png

# let's make some categorical data by creating a new column

# let's make a new column by making a list of the same shape

room_categories = []

for rm in df.RM:
    if rm < 6:
        room_categories.append('small')
    elif rm < 8:
        room_categories.append('medium')
    else:
        room_categories.append('large')

        
df['room_categories'] = room_categories

df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target room_categories
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0 medium
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6 medium
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7 medium
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4 medium
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2 medium
### let's count each room type

plt.bar(df["room_categories"].value_counts().index, df["room_categories"].value_counts().values)
plt.title("Bar Chart for Room Categories")
plt.xticks(rotation=75)
plt.show()

png

create a new column called 'room_age' that is the sum of the room and the age columns

df['room_age'] = df['RM'] + df['AGE']
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target room_categories room_age
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0 medium 71.775
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6 medium 85.321
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7 medium 68.285
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4 medium 52.798
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2 medium 61.347

let's find the statistics on the indus column

df[['INDUS']].describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
INDUS
count 506.000000
mean 11.136779
std 6.860353
min 0.460000
25% 5.190000
50% 9.690000
75% 18.100000
max 27.740000

let's slice some data

df["AGE"]>50
0       True
1       True
2       True
3      False
4       True
       ...  
501     True
502     True
503     True
504     True
505     True
Name: AGE, Length: 506, dtype: bool
# how do I get only ages greater than 50?
# three ways to do it, I'm going to show you two and I'm going to say which is the best to learn

# first way (not super unreliable, but not the best)
ages_50_plus_df = df[df["AGE"]>50]

ages_50_plus_df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target room_categories room_age
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0 medium 71.775
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6 medium 85.321
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7 medium 68.285
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2 medium 61.347
5 0.02985 0.0 2.18 0.0 0.458 6.430 58.7 6.0622 3.0 222.0 18.7 394.12 5.21 28.7 medium 65.130
# second way which is way better and scalable
ages_50_plus_df = df.loc[df["AGE"]>50]
ages_50_plus_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target room_categories room_age
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0 medium 71.775
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6 medium 85.321
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7 medium 68.285
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2 medium 61.347
5 0.02985 0.0 2.18 0.0 0.458 6.430 58.7 6.0622 3.0 222.0 18.7 394.12 5.21 28.7 medium 65.130
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
501 0.06263 0.0 11.93 0.0 0.573 6.593 69.1 2.4786 1.0 273.0 21.0 391.99 9.67 22.4 medium 75.693
502 0.04527 0.0 11.93 0.0 0.573 6.120 76.7 2.2875 1.0 273.0 21.0 396.90 9.08 20.6 medium 82.820
503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1.0 273.0 21.0 396.90 5.64 23.9 medium 97.976
504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1.0 273.0 21.0 393.45 6.48 22.0 medium 96.094
505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 273.0 21.0 396.90 7.88 11.9 medium 86.830

359 rows × 16 columns

# now let's make a dataframe with ages greater than 50 and rooms greater than 6
df.loc[(df["AGE"]>50) & (df["RM"] > 6) ]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target room_categories room_age
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0 medium 71.775
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6 medium 85.321
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7 medium 68.285
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2 medium 61.347
5 0.02985 0.0 2.18 0.0 0.458 6.430 58.7 6.0622 3.0 222.0 18.7 394.12 5.21 28.7 medium 65.130
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
501 0.06263 0.0 11.93 0.0 0.573 6.593 69.1 2.4786 1.0 273.0 21.0 391.99 9.67 22.4 medium 75.693
502 0.04527 0.0 11.93 0.0 0.573 6.120 76.7 2.2875 1.0 273.0 21.0 396.90 9.08 20.6 medium 82.820
503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1.0 273.0 21.0 396.90 5.64 23.9 medium 97.976
504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1.0 273.0 21.0 393.45 6.48 22.0 medium 96.094
505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 273.0 21.0 396.90 7.88 11.9 medium 86.830

220 rows × 16 columns

# now let's make a dataframe with ages greater than 50 or rooms greater than 6
df_new = df.loc[(df["AGE"]>50) | (df["RM"] > 6) ]
df_new.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target room_categories room_age
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0 medium 71.775
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6 medium 85.321
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7 medium 68.285
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4 medium 52.798
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2 medium 61.347
df_new[["INDUS", "CHAS", "RAD"]]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
INDUS CHAS RAD
0 2.31 0.0 1.0
1 7.07 0.0 2.0
2 7.07 0.0 2.0
3 2.18 0.0 3.0
4 2.18 0.0 3.0
... ... ... ...
501 11.93 0.0 1.0
502 11.93 0.0 1.0
503 11.93 0.0 1.0
504 11.93 0.0 1.0
505 11.93 0.0 1.0

472 rows × 3 columns

# now let's make a dataframe with ages greater than 50 or rooms greater than 6 and let's only grab 
# the CRIM and LSTAT columns

df.loc[(df["AGE"]>50) | (df["RM"] > 6)][["CRIM", "LSTAT"]] # THIS ISN'T PREFERRED
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
CRIM LSTAT
0 0.00632 4.98
1 0.02731 9.14
2 0.02729 4.03
3 0.03237 2.94
4 0.06905 5.33
... ... ...
501 0.06263 9.67
502 0.04527 9.08
503 0.06076 5.64
504 0.10959 6.48
505 0.04741 7.88

472 rows × 2 columns

df.loc[(df["AGE"]>50) | (df["RM"] > 6), ['CRIM', 'LSTAT']]  # THIS IS THE PREFERRED WAY, USING LOC
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
CRIM LSTAT
0 0.00632 4.98
1 0.02731 9.14
2 0.02729 4.03
3 0.03237 2.94
4 0.06905 5.33
... ... ...
501 0.06263 9.67
502 0.04527 9.08
503 0.06076 5.64
504 0.10959 6.48
505 0.04741 7.88

472 rows × 2 columns

plt.figure(figsize=(5, 5))
plt.bar()
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-26-7d78820f768d> in <module>
----> 1 df["room_categories"].plot(kind='bar')


~/anaconda3/envs/flatiron-env/lib/python3.6/site-packages/pandas/plotting/_core.py in __call__(self, *args, **kwargs)
    792                     data.columns = label_name
    793 
--> 794         return plot_backend.plot(data, kind=kind, **kwargs)
    795 
    796     def line(self, x=None, y=None, **kwargs):


~/anaconda3/envs/flatiron-env/lib/python3.6/site-packages/pandas/plotting/_matplotlib/__init__.py in plot(data, kind, **kwargs)
     60             kwargs["ax"] = getattr(ax, "left_ax", ax)
     61     plot_obj = PLOT_CLASSES[kind](data, **kwargs)
---> 62     plot_obj.generate()
     63     plot_obj.draw()
     64     return plot_obj.result


~/anaconda3/envs/flatiron-env/lib/python3.6/site-packages/pandas/plotting/_matplotlib/core.py in generate(self)
    277     def generate(self):
    278         self._args_adjust()
--> 279         self._compute_plot_data()
    280         self._setup_subplots()
    281         self._make_plot()


~/anaconda3/envs/flatiron-env/lib/python3.6/site-packages/pandas/plotting/_matplotlib/core.py in _compute_plot_data(self)
    412         # no non-numeric frames or series allowed
    413         if is_empty:
--> 414             raise TypeError("no numeric data to plot")
    415 
    416         # GH25587: cast ExtensionArray of pandas (IntegerArray, etc.) to


TypeError: no numeric data to plot
# pandas slicing
# get dataframe with rows where target < 30

# method 1
# df[df["target"] < 30]


# method 2
# df.loc[df["target"] < 30]
df[["AGE", "ZN"]]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
AGE ZN
0 65.2 18.0
1 78.9 0.0
2 61.1 0.0
3 45.8 0.0
4 54.2 0.0
... ... ...
501 69.1 0.0
502 76.7 0.0
503 91.0 0.0
504 89.3 0.0
505 80.8 0.0

506 rows × 2 columns

# pandas slicing
# get dataframe with rows where target < 30 but only grab the AGE and ZN columns

# method 1
# df[df["target"] < 30][["AGE", "ZN"]]


# method 2
# df.loc[df["target"] < 30, ["AGE", "ZN"]]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
AGE ZN
0 65.2 18.0
1 78.9 0.0
5 58.7 0.0
6 66.6 12.5
7 96.1 12.5
... ... ...
501 69.1 0.0
502 76.7 0.0
503 91.0 0.0
504 89.3 0.0
505 80.8 0.0

422 rows × 2 columns

# pandas slicing on multiple conditions
# target < 30 and age > 80

# method 1
# df[(df["target"]<30) & (df["AGE"] > 80)]

# method 2
df.loc[(df["target"]<30) & (df["AGE"]>80)]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target
7 0.14455 12.5 7.87 0.0 0.524 6.172 96.1 5.9505 5.0 311.0 15.2 396.90 19.15 27.1
8 0.21124 12.5 7.87 0.0 0.524 5.631 100.0 6.0821 5.0 311.0 15.2 386.63 29.93 16.5
9 0.17004 12.5 7.87 0.0 0.524 6.004 85.9 6.5921 5.0 311.0 15.2 386.71 17.10 18.9
10 0.22489 12.5 7.87 0.0 0.524 6.377 94.3 6.3467 5.0 311.0 15.2 392.52 20.45 15.0
11 0.11747 12.5 7.87 0.0 0.524 6.009 82.9 6.2267 5.0 311.0 15.2 396.90 13.27 18.9
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
491 0.10574 0.0 27.74 0.0 0.609 5.983 98.8 1.8681 4.0 711.0 20.1 390.11 18.07 13.6
492 0.11132 0.0 27.74 0.0 0.609 5.983 83.5 2.1099 4.0 711.0 20.1 396.90 13.35 20.1
503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1.0 273.0 21.0 396.90 5.64 23.9
504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1.0 273.0 21.0 393.45 6.48 22.0
505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 273.0 21.0 396.90 7.88 11.9

215 rows × 14 columns

# pandas slicing on mult. conditions for specific columns
# target > 30 and age > 75 but only grab the target and age columns


# method 1
# df[(df["target"] > 30) & (df["AGE"] > 75)][["target", "AGE"]]

# method 2
# df.loc[(df["target"] > 30) & (df["AGE"] > 75), ["target", "AGE"]]

# method 3
# df[["AGE", "target"]][(df["target"]>30) & (df["AGE"]>75)]
# pandas slicing on mult. conditions for specific columns
# target > 30 and age > 75 but only grab the CRIM


df.loc[(df["target"]>30) & (df["AGE"]>75), ["CRIM"]]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
CRIM
97 0.12083
157 1.22358
161 1.46336
162 1.83377
163 1.51902
166 2.01019
180 0.06588
182 0.09103
183 0.10008
223 0.61470
224 0.31533
225 0.52693
226 0.38214
227 0.41238
231 0.46296
257 0.61154
258 0.66351
259 0.65665
260 0.54011
261 0.53412
262 0.52014
263 0.82526
264 0.55007
266 0.78570
368 4.89822
369 5.66998
370 6.53876
371 9.23230
372 8.26725
# Plot a scattermatrix of your dataframe

pd.plotting.scatter_matrix(df, figsize=(20, 20), grid=True, hist_kwds={"bins": 20, "color":"purple"})
plt.show()

png

# Make a function that plots a specific column of the dataframe as a histogram
# Make the color of it purple by default, alpha value should be 0.8 by default
# Make a parameter to toggle a grid
# Make parameters for axis labels and the title of the histogram
# Make a parameter for number of bins and default it to 20
# call the function `plot_histogram`

def plot_histogram(df, column, bins=20, color="purple", grid=True, title=None, xlabel=None, ylabel="counts"):
    plt.figure(figsize=(8, 5))
    if grid:
        plt.grid(zorder=0)
    plt.hist(x=df[column], bins=bins, color=color, zorder=2)
    if title is None:
        title = f"Histogram for {column}"
    plt.title(title)
    if xlabel is None:
        xlabel = column
    plt.xlabel(xlabel.lower())
    plt.ylabel(ylabel)
    plt.show()
plot_histogram(df, "AGE")

png

# Plot a hexbin plot of AGE vs Indus colored by target values

df.plot.hexbin(x='AGE', y='CRIM', C='target', gridsize=20, figsize=(8, 5))
plt.show()

png

df.corr()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target room_age
CRIM 1.000000 -0.200469 0.406583 -0.055892 0.420972 -0.219247 0.352734 -0.379670 0.625505 0.582764 0.289946 -0.385064 0.455621 -0.388305 0.349253
ZN -0.200469 1.000000 -0.533828 -0.042697 -0.516604 0.311991 -0.569537 0.664408 -0.311948 -0.314563 -0.391679 0.175520 -0.412995 0.360445 -0.564971
INDUS 0.406583 -0.533828 1.000000 0.062938 0.763651 -0.391676 0.644779 -0.708027 0.595129 0.720760 0.383248 -0.356977 0.603800 -0.483725 0.638643
CHAS -0.055892 -0.042697 0.062938 1.000000 0.091203 0.091251 0.086518 -0.099176 -0.007368 -0.035587 -0.121515 0.048788 -0.053929 0.175260 0.089305
NOX 0.420972 -0.516604 0.763651 0.091203 1.000000 -0.302188 0.731470 -0.769230 0.611441 0.668023 0.188933 -0.380051 0.590879 -0.427321 0.728079
RM -0.219247 0.311991 -0.391676 0.091251 -0.302188 1.000000 -0.240265 0.205246 -0.209847 -0.292048 -0.355501 0.128069 -0.613808 0.695360 -0.216539
AGE 0.352734 -0.569537 0.644779 0.086518 0.731470 -0.240265 1.000000 -0.747881 0.456022 0.506456 0.261515 -0.273534 0.602339 -0.376955 0.999703
DIS -0.379670 0.664408 -0.708027 -0.099176 -0.769230 0.205246 -0.747881 1.000000 -0.494588 -0.534432 -0.232471 0.291512 -0.496996 0.249929 -0.747017
RAD 0.625505 -0.311948 0.595129 -0.007368 0.611441 -0.209847 0.456022 -0.494588 1.000000 0.910228 0.464741 -0.444413 0.488676 -0.381626 0.453370
TAX 0.582764 -0.314563 0.720760 -0.035587 0.668023 -0.292048 0.506456 -0.534432 0.910228 1.000000 0.460853 -0.441808 0.543993 -0.468536 0.502028
PTRATIO 0.289946 -0.391679 0.383248 -0.121515 0.188933 -0.355501 0.261515 -0.232471 0.464741 0.460853 1.000000 -0.177383 0.374044 -0.507787 0.254090
B -0.385064 0.175520 -0.356977 0.048788 -0.380051 0.128069 -0.273534 0.291512 -0.444413 -0.441808 -0.177383 1.000000 -0.366087 0.333461 -0.271888
LSTAT 0.455621 -0.412995 0.603800 -0.053929 0.590879 -0.613808 0.602339 -0.496996 0.488676 0.543993 0.374044 -0.366087 1.000000 -0.737663 0.590384
target -0.388305 0.360445 -0.483725 0.175260 -0.427321 0.695360 -0.376955 0.249929 -0.381626 -0.468536 -0.507787 0.333461 -0.737663 1.000000 -0.361660
room_age 0.349253 -0.564971 0.638643 0.089305 0.728079 -0.216539 0.999703 -0.747017 0.453370 0.502028 0.254090 -0.271888 0.590384 -0.361660 1.000000
# Plot a correlation heatmap using the `.corr()` method and seaborn's heatmap
# Annotate your heatmap using 2 floating points
# Use the 'Blues' color scheme for your heatmap

corr = df.corr()

plt.figure(figsize=(20, 20))
sns.heatmap(corr, annot=True, fmt='0.2g', cmap=sns.color_palette("Purples"))
plt.show()

png

New Image

demo_df = pd.read_csv("demo.csv")
demo_df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
column1 column2 column3 column4
0 0 1 2 'this'
1 1 2 3 'that'
2 3 4 5 'the other'
3 6 7 8 'the other other'
4 9 10 11 'ant man'
demo_df[" column4"].str.title().str.swapcase()
0                'tHIS'
1                'tHAT'
2           'tHE oTHER'
3     'tHE oTHER oTHER'
4             'aNT mAN'
5                'hULK'
6       'sCARLET wITCH'
Name:  column4, dtype: object
## let's make a new column using a lambda function

df["rooms_rounded"] = df["RM"].apply(lambda x : x//1)

df["rooms_doubled_rounded"] = df["RM"].apply(lambda x : (2*x)//1)


df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target room_categories room_age rooms_rounded rooms_doubled_rounded
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0 medium 71.775 6.0 13.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6 medium 85.321 6.0 12.0
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7 medium 68.285 7.0 14.0
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4 medium 52.798 6.0 13.0
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2 medium 61.347 7.0 14.0

Assessment

  • what is the different between a list object and a numpy.array object?
  • what is the benefit of using numpy vs writing your own methods?
  • what is the index of a dataframe? What is a rule for the index? What are columns?
  • how do we find the mean of a specific column in a dataframe?
  • plot a hist, scatterplot, lineplot, hexmap, heatmap

cv2-mod4-section04-pandas-dataviz-lesson's People

Contributors

erdos2n avatar erdosn avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.