import pandas as pd
df = pd.read_csv('lego_sets.csv')
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
ages | list_price | num_reviews | piece_count | play_star_rating | prod_desc | prod_id | prod_long_desc | review_difficulty | set_name | star_rating | theme_name | val_star_rating | country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6-12 | 29.99 | 2.0 | 277.0 | 4.0 | Catapult into action and take back the eggs fr... | 75823.0 | Use the staircase catapult to launch Red into ... | Average | Bird Island Egg Heist | 4.5 | Angry Birds™ | 4.0 | US |
1 | 6-12 | 19.99 | 2.0 | 168.0 | 4.0 | Launch a flying attack and rescue the eggs fro... | 75822.0 | Pilot Pig has taken off from Bird Island with ... | Easy | Piggy Plane Attack | 5.0 | Angry Birds™ | 4.0 | US |
2 | 6-12 | 12.99 | 11.0 | 74.0 | 4.3 | Chase the piggy with lightning-fast Chuck and ... | 75821.0 | Pitch speedy bird Chuck against the Piggy Car.... | Easy | Piggy Car Escape | 4.3 | Angry Birds™ | 4.1 | US |
3 | 12+ | 99.99 | 23.0 | 1032.0 | 3.6 | Explore the architecture of the United States ... | 21030.0 | Discover the architectural secrets of the icon... | Average | United States Capitol Building | 4.6 | Architecture | 4.3 | US |
4 | 12+ | 79.99 | 14.0 | 744.0 | 3.2 | Recreate the Solomon R. Guggenheim Museum® wit... | 21035.0 | Discover the architectural secrets of Frank Ll... | Challenging | Solomon R. Guggenheim Museum® | 4.6 | Architecture | 4.1 | US |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12261 entries, 0 to 12260
Data columns (total 14 columns):
ages 12261 non-null object
list_price 12261 non-null float64
num_reviews 10641 non-null float64
piece_count 12261 non-null float64
play_star_rating 10486 non-null float64
prod_desc 11884 non-null object
prod_id 12261 non-null float64
prod_long_desc 12261 non-null object
review_difficulty 10206 non-null object
set_name 12261 non-null object
star_rating 10641 non-null float64
theme_name 12258 non-null object
val_star_rating 10466 non-null float64
country 12261 non-null object
dtypes: float64(7), object(7)
memory usage: 1.3+ MB
As we'll see later, we'll often want to create new features for our data sets in order to improve the performance of various machine learning algorithms. Let's practice this with a few examples.
Let's create a new column that lists the mean price for the theme that that particular lego set is from. This could prove useful for a regression algorithm that we'll be building later!
Here's a general outline:
* Calculate average price per theme; use the groupby method, subset to price and calculate the mean
* Create a dictionary of {theme : avg_price}
* Make the new column; map the dictionary to the original theme_column and save the results to a new column
#Your code here
#Groupby theme_name and calculate average price
grouped = df.groupby('theme_name')['list_price'].mean()#Your code here
#Can be helpful to preview your intermediate transformations
grouped.head()
theme_name
Angry Birds™ 21.021100
Architecture 65.082371
BOOST 196.572316
Blue's Helicopter Pursuit 61.934648
BrickHeadz 14.868018
Name: list_price, dtype: float64
#Your code here
#Create Dictionary
theme_price_dict = dict(grouped)
#Create new column with dictionary
df['Theme_Avg_Price'] = df.theme_name.map(theme_price_dict)
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
ages | list_price | num_reviews | piece_count | play_star_rating | prod_desc | prod_id | prod_long_desc | review_difficulty | set_name | star_rating | theme_name | val_star_rating | country | Theme_Avg_Price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6-12 | 29.99 | 2.0 | 277.0 | 4.0 | Catapult into action and take back the eggs fr... | 75823.0 | Use the staircase catapult to launch Red into ... | Average | Bird Island Egg Heist | 4.5 | Angry Birds™ | 4.0 | US | 21.021100 |
1 | 6-12 | 19.99 | 2.0 | 168.0 | 4.0 | Launch a flying attack and rescue the eggs fro... | 75822.0 | Pilot Pig has taken off from Bird Island with ... | Easy | Piggy Plane Attack | 5.0 | Angry Birds™ | 4.0 | US | 21.021100 |
2 | 6-12 | 12.99 | 11.0 | 74.0 | 4.3 | Chase the piggy with lightning-fast Chuck and ... | 75821.0 | Pitch speedy bird Chuck against the Piggy Car.... | Easy | Piggy Car Escape | 4.3 | Angry Birds™ | 4.1 | US | 21.021100 |
3 | 12+ | 99.99 | 23.0 | 1032.0 | 3.6 | Explore the architecture of the United States ... | 21030.0 | Discover the architectural secrets of the icon... | Average | United States Capitol Building | 4.6 | Architecture | 4.3 | US | 65.082371 |
4 | 12+ | 79.99 | 14.0 | 744.0 | 3.2 | Recreate the Solomon R. Guggenheim Museum® wit... | 21035.0 | Discover the architectural secrets of Frank Ll... | Challenging | Solomon R. Guggenheim Museum® | 4.6 | Architecture | 4.1 | US | 65.082371 |
Expand upon our previous example by writing a function that takes in a column to group by and a column to take the average of (in our previous example, theme_name and list_price) and creates a new column to our dataframe corresponding to the average value for the category to which that feature corresponds.
def avg_feat(cfeat, nfeat):
new_col = '{}_Avg_{}'.format(cfeat, nfeat)
grouped = df.groupby(cfeat)[nfeat].mean()
df[new_col] = df[cfeat].map(dict(grouped))
Now write a for loop that iterates over several category columns and several numerical columns, and apply your above function to create a new column of the average values for the categorical feature.
df.columns
Index(['ages', 'list_price', 'num_reviews', 'piece_count', 'play_star_rating',
'prod_desc', 'prod_id', 'prod_long_desc', 'review_difficulty',
'set_name', 'star_rating', 'theme_name', 'val_star_rating', 'country',
'Theme_Avg_Price'],
dtype='object')
cat_feats = ['ages', 'review_difficulty', 'country']
num_feats = ['list_price', 'num_reviews', 'piece_count', 'play_star_rating', 'star_rating', 'val_star_rating']
for cfeat in cat_feats:
for nfeat in num_feats:
avg_feat(cfeat, nfeat)
print(df.columns)
df.head(2)
Index(['ages', 'list_price', 'num_reviews', 'piece_count', 'play_star_rating',
'prod_desc', 'prod_id', 'prod_long_desc', 'review_difficulty',
'set_name', 'star_rating', 'theme_name', 'val_star_rating', 'country',
'Theme_Avg_Price', 'ages_Avg_list_price', 'ages_Avg_num_reviews',
'ages_Avg_piece_count', 'ages_Avg_play_star_rating',
'ages_Avg_star_rating', 'ages_Avg_val_star_rating',
'review_difficulty_Avg_list_price', 'review_difficulty_Avg_num_reviews',
'review_difficulty_Avg_piece_count',
'review_difficulty_Avg_play_star_rating',
'review_difficulty_Avg_star_rating',
'review_difficulty_Avg_val_star_rating', 'country_Avg_list_price',
'country_Avg_num_reviews', 'country_Avg_piece_count',
'country_Avg_play_star_rating', 'country_Avg_star_rating',
'country_Avg_val_star_rating', 'Has_Trademark', 'Has_Registered'],
dtype='object')
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
ages | list_price | num_reviews | piece_count | play_star_rating | prod_desc | prod_id | prod_long_desc | review_difficulty | set_name | ... | review_difficulty_Avg_star_rating | review_difficulty_Avg_val_star_rating | country_Avg_list_price | country_Avg_num_reviews | country_Avg_piece_count | country_Avg_play_star_rating | country_Avg_star_rating | country_Avg_val_star_rating | Has_Trademark | Has_Registered | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6-12 | 29.99 | 2.0 | 277.0 | 4.0 | Catapult into action and take back the eggs fr... | 75823.0 | Use the staircase catapult to launch Red into ... | Average | Bird Island Egg Heist | ... | 4.529934 | 4.232244 | 47.252546 | 14.564673 | 421.816401 | 4.320282 | 4.507081 | 4.238505 | True | False |
1 | 6-12 | 19.99 | 2.0 | 168.0 | 4.0 | Launch a flying attack and rescue the eggs fro... | 75822.0 | Pilot Pig has taken off from Bird Island with ... | Easy | Piggy Plane Attack | ... | 4.490274 | 4.235066 | 47.252546 | 14.564673 | 421.816401 | 4.320282 | 4.507081 | 4.238505 | True | False |
2 rows × 35 columns
Create a new column for whether or not the theme name contains a trademark (TM) designation.
df.theme_name.value_counts(normalize=True)[:5]
Star Wars™ 0.112335
DUPLO® 0.095122
City 0.089085
Juniors 0.079785
THE LEGO® NINJAGO® MOVIE™ 0.064937
Name: theme_name, dtype: float64
#Your code here
df['Has_Trademark'] = df.theme_name.str.contains('™')
df.Has_Trademark.value_counts(normalize=True)
False 0.721406
True 0.278594
Name: Has_Trademark, dtype: float64
Create a new column for whether or not the set name contains a registered (R) designation.
#Your code here
df['Has_Registered'] = df.theme_name.str.contains('®')
df.Has_Registered.value_counts(normalize=True)
False 0.744657
True 0.255343
Name: Has_Registered, dtype: float64
In future algorithms and applications, having null values can be problematic. Due to this, dealing with null values is a common problem in data science. Below are a few options at your disposal.
One option for dealing with null values is simply subseting your data to rows without missing values. You can subset a dataframe according to a criterion like this:
subset = df[df.theme_name=='Angry Birds™']
subset.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
ages | list_price | num_reviews | piece_count | play_star_rating | prod_desc | prod_id | prod_long_desc | review_difficulty | set_name | star_rating | theme_name | val_star_rating | country | Theme_Avg_Price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6-12 | 29.9900 | 2.0 | 277.0 | 4.0 | Catapult into action and take back the eggs fr... | 75823.0 | Use the staircase catapult to launch Red into ... | Average | Bird Island Egg Heist | 4.5 | Angry Birds™ | 4.0 | US | 21.0211 |
1 | 6-12 | 19.9900 | 2.0 | 168.0 | 4.0 | Launch a flying attack and rescue the eggs fro... | 75822.0 | Pilot Pig has taken off from Bird Island with ... | Easy | Piggy Plane Attack | 5.0 | Angry Birds™ | 4.0 | US | 21.0211 |
2 | 6-12 | 12.9900 | 11.0 | 74.0 | 4.3 | Chase the piggy with lightning-fast Chuck and ... | 75821.0 | Pitch speedy bird Chuck against the Piggy Car.... | Easy | Piggy Car Escape | 4.3 | Angry Birds™ | 4.1 | US | 21.0211 |
2528 | 6-12 | 31.1922 | 2.0 | 277.0 | 4.0 | Catapult into action and take back the eggs fr... | 75823.0 | Use the staircase catapult to launch Red into ... | Average | Bird Island Egg Heist | 4.5 | Angry Birds™ | 4.0 | CA | 21.0211 |
2529 | 6-12 | 19.4922 | 2.0 | 168.0 | 4.0 | Launch a flying attack and rescue the eggs fro... | 75822.0 | Pilot Pig has taken off from Bird Island with ... | Easy | Piggy Plane Attack | 5.0 | Angry Birds™ | 4.0 | CA | 21.0211 |
You can then chain the .isnull()
method along with the ~
which negates an expression to remove null values. For example:
print(len(df))
populated = df[~df.theme_name.isnull()] #The tilde (~) negates the conditional, turning all True values False and vice versa
print(len(populated))
12261
12258
Practice subsetting the dataframe by removing all entries where the star_rating is not populated.
#Your code here
subset = df[~df.star_rating.isnull()]
Another option for corraling data with missing values is to impute an average (or other) value. For example, rather then dropping all rows where there is no star_rating, we could impute value such as the average star_rating for all sets, or the average star_rating for similar sets. You'll practice an initial example of that here.
Update the star_rating column for those entries where there is no value. Do this by filling in the average value.
#Your code here
avg = df.star_rating.mean()
print(avg)
df.star_rating = df.star_rating.fillna(value=avg)
4.514134009961459