Import Pandas

import pandas as pd

Load and Preview the Dataset

df = pd.read_csv('lego_sets.csv')
df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	ages	list_price	num_reviews	piece_count	play_star_rating	prod_desc	prod_id	prod_long_desc	review_difficulty	set_name	star_rating	theme_name	val_star_rating	country
0	6-12	29.99	2.0	277.0	4.0	Catapult into action and take back the eggs fr...	75823.0	Use the staircase catapult to launch Red into ...	Average	Bird Island Egg Heist	4.5	Angry Birds™	4.0	US
1	6-12	19.99	2.0	168.0	4.0	Launch a flying attack and rescue the eggs fro...	75822.0	Pilot Pig has taken off from Bird Island with ...	Easy	Piggy Plane Attack	5.0	Angry Birds™	4.0	US
2	6-12	12.99	11.0	74.0	4.3	Chase the piggy with lightning-fast Chuck and ...	75821.0	Pitch speedy bird Chuck against the Piggy Car....	Easy	Piggy Car Escape	4.3	Angry Birds™	4.1	US
3	12+	99.99	23.0	1032.0	3.6	Explore the architecture of the United States ...	21030.0	Discover the architectural secrets of the icon...	Average	United States Capitol Building	4.6	Architecture	4.3	US
4	12+	79.99	14.0	744.0	3.2	Recreate the Solomon R. Guggenheim Museum® wit...	21035.0	Discover the architectural secrets of Frank Ll...	Challenging	Solomon R. Guggenheim Museum®	4.6	Architecture	4.1	US

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12261 entries, 0 to 12260
Data columns (total 14 columns):
ages                 12261 non-null object
list_price           12261 non-null float64
num_reviews          10641 non-null float64
piece_count          12261 non-null float64
play_star_rating     10486 non-null float64
prod_desc            11884 non-null object
prod_id              12261 non-null float64
prod_long_desc       12261 non-null object
review_difficulty    10206 non-null object
set_name             12261 non-null object
star_rating          10641 non-null float64
theme_name           12258 non-null object
val_star_rating      10466 non-null float64
country              12261 non-null object
dtypes: float64(7), object(7)
memory usage: 1.3+ MB

Feature Engingeering

As we'll see later, we'll often want to create new features for our data sets in order to improve the performance of various machine learning algorithms. Let's practice this with a few examples.

Mean Price by Theme

Let's create a new column that lists the mean price for the theme that that particular lego set is from. This could prove useful for a regression algorithm that we'll be building later!

Here's a general outline: * Calculate average price per theme; use the groupby method, subset to price and calculate the mean * Create a dictionary of {theme : avg_price} * Make the new column; map the dictionary to the original theme_column and save the results to a new column

#Your code here

#Groupby theme_name and calculate average price
grouped = df.groupby('theme_name')['list_price'].mean()#Your code here

#Can be helpful to preview your intermediate transformations
grouped.head()

theme_name
Angry Birds™                  21.021100
Architecture                  65.082371
BOOST                        196.572316
Blue's Helicopter Pursuit     61.934648
BrickHeadz                    14.868018
Name: list_price, dtype: float64

#Your code here

#Create Dictionary
theme_price_dict = dict(grouped)
#Create new column with dictionary
df['Theme_Avg_Price'] = df.theme_name.map(theme_price_dict)
df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	ages	list_price	num_reviews	piece_count	play_star_rating	prod_desc	prod_id	prod_long_desc	review_difficulty	set_name	star_rating	theme_name	val_star_rating	country	Theme_Avg_Price
0	6-12	29.99	2.0	277.0	4.0	Catapult into action and take back the eggs fr...	75823.0	Use the staircase catapult to launch Red into ...	Average	Bird Island Egg Heist	4.5	Angry Birds™	4.0	US	21.021100
1	6-12	19.99	2.0	168.0	4.0	Launch a flying attack and rescue the eggs fro...	75822.0	Pilot Pig has taken off from Bird Island with ...	Easy	Piggy Plane Attack	5.0	Angry Birds™	4.0	US	21.021100
2	6-12	12.99	11.0	74.0	4.3	Chase the piggy with lightning-fast Chuck and ...	75821.0	Pitch speedy bird Chuck against the Piggy Car....	Easy	Piggy Car Escape	4.3	Angry Birds™	4.1	US	21.021100
3	12+	99.99	23.0	1032.0	3.6	Explore the architecture of the United States ...	21030.0	Discover the architectural secrets of the icon...	Average	United States Capitol Building	4.6	Architecture	4.3	US	65.082371
4	12+	79.99	14.0	744.0	3.2	Recreate the Solomon R. Guggenheim Museum® wit...	21035.0	Discover the architectural secrets of Frank Ll...	Challenging	Solomon R. Guggenheim Museum®	4.6	Architecture	4.1	US	65.082371

Extending Code with for Loops

Expand upon our previous example by writing a function that takes in a column to group by and a column to take the average of (in our previous example, theme_name and list_price) and creates a new column to our dataframe corresponding to the average value for the category to which that feature corresponds.

def avg_feat(cfeat, nfeat):
    new_col = '{}_Avg_{}'.format(cfeat, nfeat)
    grouped = df.groupby(cfeat)[nfeat].mean()
    df[new_col] = df[cfeat].map(dict(grouped))

Applying your function

Now write a for loop that iterates over several category columns and several numerical columns, and apply your above function to create a new column of the average values for the categorical feature.

df.columns

Index(['ages', 'list_price', 'num_reviews', 'piece_count', 'play_star_rating',
       'prod_desc', 'prod_id', 'prod_long_desc', 'review_difficulty',
       'set_name', 'star_rating', 'theme_name', 'val_star_rating', 'country',
       'Theme_Avg_Price'],
      dtype='object')

cat_feats = ['ages', 'review_difficulty', 'country']
num_feats = ['list_price', 'num_reviews', 'piece_count', 'play_star_rating', 'star_rating', 'val_star_rating']
for cfeat in cat_feats:
    for nfeat in num_feats:
        avg_feat(cfeat, nfeat)
print(df.columns)
df.head(2)

Index(['ages', 'list_price', 'num_reviews', 'piece_count', 'play_star_rating',
       'prod_desc', 'prod_id', 'prod_long_desc', 'review_difficulty',
       'set_name', 'star_rating', 'theme_name', 'val_star_rating', 'country',
       'Theme_Avg_Price', 'ages_Avg_list_price', 'ages_Avg_num_reviews',
       'ages_Avg_piece_count', 'ages_Avg_play_star_rating',
       'ages_Avg_star_rating', 'ages_Avg_val_star_rating',
       'review_difficulty_Avg_list_price', 'review_difficulty_Avg_num_reviews',
       'review_difficulty_Avg_piece_count',
       'review_difficulty_Avg_play_star_rating',
       'review_difficulty_Avg_star_rating',
       'review_difficulty_Avg_val_star_rating', 'country_Avg_list_price',
       'country_Avg_num_reviews', 'country_Avg_piece_count',
       'country_Avg_play_star_rating', 'country_Avg_star_rating',
       'country_Avg_val_star_rating', 'Has_Trademark', 'Has_Registered'],
      dtype='object')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	ages	list_price	num_reviews	piece_count	play_star_rating	prod_desc	prod_id	prod_long_desc	review_difficulty	set_name	...	review_difficulty_Avg_star_rating	review_difficulty_Avg_val_star_rating	country_Avg_list_price	country_Avg_num_reviews	country_Avg_piece_count	country_Avg_play_star_rating	country_Avg_star_rating	country_Avg_val_star_rating	Has_Trademark	Has_Registered
0	6-12	29.99	2.0	277.0	4.0	Catapult into action and take back the eggs fr...	75823.0	Use the staircase catapult to launch Red into ...	Average	Bird Island Egg Heist	...	4.529934	4.232244	47.252546	14.564673	421.816401	4.320282	4.507081	4.238505	True	False
1	6-12	19.99	2.0	168.0	4.0	Launch a flying attack and rescue the eggs fro...	75822.0	Pilot Pig has taken off from Bird Island with ...	Easy	Piggy Plane Attack	...	4.490274	4.235066	47.252546	14.564673	421.816401	4.320282	4.507081	4.238505	True	False

2 rows × 35 columns

Feature Engineering 2

Create a new column for whether or not the theme name contains a trademark (TM) designation.

df.theme_name.value_counts(normalize=True)[:5]

Star Wars™                   0.112335
DUPLO®                       0.095122
City                         0.089085
Juniors                      0.079785
THE LEGO® NINJAGO® MOVIE™    0.064937
Name: theme_name, dtype: float64

#Your code here
df['Has_Trademark'] = df.theme_name.str.contains('™')
df.Has_Trademark.value_counts(normalize=True)

False    0.721406
True     0.278594
Name: Has_Trademark, dtype: float64

Feature Engineering 3

Create a new column for whether or not the set name contains a registered (R) designation.

#Your code here
df['Has_Registered'] = df.theme_name.str.contains('®')
df.Has_Registered.value_counts(normalize=True)

False    0.744657
True     0.255343
Name: Has_Registered, dtype: float64

Dealing with Null Values

In future algorithms and applications, having null values can be problematic. Due to this, dealing with null values is a common problem in data science. Below are a few options at your disposal.

Subsetting the DataFrame

One option for dealing with null values is simply subseting your data to rows without missing values. You can subset a dataframe according to a criterion like this:

subset = df[df.theme_name=='Angry Birds™']
subset.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	ages	list_price	num_reviews	piece_count	play_star_rating	prod_desc	prod_id	prod_long_desc	review_difficulty	set_name	star_rating	theme_name	val_star_rating	country	Theme_Avg_Price
0	6-12	29.9900	2.0	277.0	4.0	Catapult into action and take back the eggs fr...	75823.0	Use the staircase catapult to launch Red into ...	Average	Bird Island Egg Heist	4.5	Angry Birds™	4.0	US	21.0211
1	6-12	19.9900	2.0	168.0	4.0	Launch a flying attack and rescue the eggs fro...	75822.0	Pilot Pig has taken off from Bird Island with ...	Easy	Piggy Plane Attack	5.0	Angry Birds™	4.0	US	21.0211
2	6-12	12.9900	11.0	74.0	4.3	Chase the piggy with lightning-fast Chuck and ...	75821.0	Pitch speedy bird Chuck against the Piggy Car....	Easy	Piggy Car Escape	4.3	Angry Birds™	4.1	US	21.0211
2528	6-12	31.1922	2.0	277.0	4.0	Catapult into action and take back the eggs fr...	75823.0	Use the staircase catapult to launch Red into ...	Average	Bird Island Egg Heist	4.5	Angry Birds™	4.0	CA	21.0211
2529	6-12	19.4922	2.0	168.0	4.0	Launch a flying attack and rescue the eggs fro...	75822.0	Pilot Pig has taken off from Bird Island with ...	Easy	Piggy Plane Attack	5.0	Angry Birds™	4.0	CA	21.0211

You can then chain the .isnull() method along with the ~ which negates an expression to remove null values. For example:

print(len(df))
populated = df[~df.theme_name.isnull()] #The tilde (~) negates the conditional, turning all True values False and vice versa
print(len(populated))

12261
12258

Removing Null Values

Practice subsetting the dataframe by removing all entries where the star_rating is not populated.

#Your code here
subset = df[~df.star_rating.isnull()]

Imputing Missing Values

Another option for corraling data with missing values is to impute an average (or other) value. For example, rather then dropping all rows where there is no star_rating, we could impute value such as the average star_rating for all sets, or the average star_rating for similar sets. You'll practice an initial example of that here.

Update the Star Rating Column

Update the star_rating column for those entries where there is no value. Do this by filling in the average value.

#Your code here
avg = df.star_rating.mean()
print(avg)
df.star_rating = df.star_rating.fillna(value=avg)

4.514134009961459

learn-co-students / ds-skills-cleaning-transforming-nyc-ds-091018 Goto Github PK

ds-skills-cleaning-transforming-nyc-ds-091018's Introduction

Import Pandas

Load and Preview the Dataset

Feature Engingeering

Mean Price by Theme

Extending Code with for Loops

Applying your function

Feature Engineering 2

Feature Engineering 3

Dealing with Null Values

Subsetting the DataFrame

Removing Null Values

Imputing Missing Values

Update the Star Rating Column

ds-skills-cleaning-transforming-nyc-ds-091018's People

Contributors

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org