Coder Social home page Coder Social logo

ds-skills-cleaning-transforming-nyc-ds-091018's Introduction

Import Pandas

import pandas as pd

Load and Preview the Dataset

df = pd.read_csv('lego_sets.csv')
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
ages list_price num_reviews piece_count play_star_rating prod_desc prod_id prod_long_desc review_difficulty set_name star_rating theme_name val_star_rating country
0 6-12 29.99 2.0 277.0 4.0 Catapult into action and take back the eggs fr... 75823.0 Use the staircase catapult to launch Red into ... Average Bird Island Egg Heist 4.5 Angry Birds™ 4.0 US
1 6-12 19.99 2.0 168.0 4.0 Launch a flying attack and rescue the eggs fro... 75822.0 Pilot Pig has taken off from Bird Island with ... Easy Piggy Plane Attack 5.0 Angry Birds™ 4.0 US
2 6-12 12.99 11.0 74.0 4.3 Chase the piggy with lightning-fast Chuck and ... 75821.0 Pitch speedy bird Chuck against the Piggy Car.... Easy Piggy Car Escape 4.3 Angry Birds™ 4.1 US
3 12+ 99.99 23.0 1032.0 3.6 Explore the architecture of the United States ... 21030.0 Discover the architectural secrets of the icon... Average United States Capitol Building 4.6 Architecture 4.3 US
4 12+ 79.99 14.0 744.0 3.2 Recreate the Solomon R. Guggenheim Museum® wit... 21035.0 Discover the architectural secrets of Frank Ll... Challenging Solomon R. Guggenheim Museum® 4.6 Architecture 4.1 US
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12261 entries, 0 to 12260
Data columns (total 14 columns):
ages                 12261 non-null object
list_price           12261 non-null float64
num_reviews          10641 non-null float64
piece_count          12261 non-null float64
play_star_rating     10486 non-null float64
prod_desc            11884 non-null object
prod_id              12261 non-null float64
prod_long_desc       12261 non-null object
review_difficulty    10206 non-null object
set_name             12261 non-null object
star_rating          10641 non-null float64
theme_name           12258 non-null object
val_star_rating      10466 non-null float64
country              12261 non-null object
dtypes: float64(7), object(7)
memory usage: 1.3+ MB

Feature Engingeering

As we'll see later, we'll often want to create new features for our data sets in order to improve the performance of various machine learning algorithms. Let's practice this with a few examples.

Mean Price by Theme

Let's create a new column that lists the mean price for the theme that that particular lego set is from. This could prove useful for a regression algorithm that we'll be building later!

Here's a general outline: * Calculate average price per theme; use the groupby method, subset to price and calculate the mean * Create a dictionary of {theme : avg_price} * Make the new column; map the dictionary to the original theme_column and save the results to a new column

#Your code here

#Groupby theme_name and calculate average price
grouped = df.groupby('theme_name')['list_price'].mean()#Your code here

#Can be helpful to preview your intermediate transformations
grouped.head()
theme_name
Angry Birds™                  21.021100
Architecture                  65.082371
BOOST                        196.572316
Blue's Helicopter Pursuit     61.934648
BrickHeadz                    14.868018
Name: list_price, dtype: float64
#Your code here

#Create Dictionary
theme_price_dict = dict(grouped)
#Create new column with dictionary
df['Theme_Avg_Price'] = df.theme_name.map(theme_price_dict)
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
ages list_price num_reviews piece_count play_star_rating prod_desc prod_id prod_long_desc review_difficulty set_name star_rating theme_name val_star_rating country Theme_Avg_Price
0 6-12 29.99 2.0 277.0 4.0 Catapult into action and take back the eggs fr... 75823.0 Use the staircase catapult to launch Red into ... Average Bird Island Egg Heist 4.5 Angry Birds™ 4.0 US 21.021100
1 6-12 19.99 2.0 168.0 4.0 Launch a flying attack and rescue the eggs fro... 75822.0 Pilot Pig has taken off from Bird Island with ... Easy Piggy Plane Attack 5.0 Angry Birds™ 4.0 US 21.021100
2 6-12 12.99 11.0 74.0 4.3 Chase the piggy with lightning-fast Chuck and ... 75821.0 Pitch speedy bird Chuck against the Piggy Car.... Easy Piggy Car Escape 4.3 Angry Birds™ 4.1 US 21.021100
3 12+ 99.99 23.0 1032.0 3.6 Explore the architecture of the United States ... 21030.0 Discover the architectural secrets of the icon... Average United States Capitol Building 4.6 Architecture 4.3 US 65.082371
4 12+ 79.99 14.0 744.0 3.2 Recreate the Solomon R. Guggenheim Museum® wit... 21035.0 Discover the architectural secrets of Frank Ll... Challenging Solomon R. Guggenheim Museum® 4.6 Architecture 4.1 US 65.082371

Extending Code with for Loops

Expand upon our previous example by writing a function that takes in a column to group by and a column to take the average of (in our previous example, theme_name and list_price) and creates a new column to our dataframe corresponding to the average value for the category to which that feature corresponds.

def avg_feat(cfeat, nfeat):
    new_col = '{}_Avg_{}'.format(cfeat, nfeat)
    grouped = df.groupby(cfeat)[nfeat].mean()
    df[new_col] = df[cfeat].map(dict(grouped))

Applying your function

Now write a for loop that iterates over several category columns and several numerical columns, and apply your above function to create a new column of the average values for the categorical feature.

df.columns
Index(['ages', 'list_price', 'num_reviews', 'piece_count', 'play_star_rating',
       'prod_desc', 'prod_id', 'prod_long_desc', 'review_difficulty',
       'set_name', 'star_rating', 'theme_name', 'val_star_rating', 'country',
       'Theme_Avg_Price'],
      dtype='object')
cat_feats = ['ages', 'review_difficulty', 'country']
num_feats = ['list_price', 'num_reviews', 'piece_count', 'play_star_rating', 'star_rating', 'val_star_rating']
for cfeat in cat_feats:
    for nfeat in num_feats:
        avg_feat(cfeat, nfeat)
print(df.columns)
df.head(2)
Index(['ages', 'list_price', 'num_reviews', 'piece_count', 'play_star_rating',
       'prod_desc', 'prod_id', 'prod_long_desc', 'review_difficulty',
       'set_name', 'star_rating', 'theme_name', 'val_star_rating', 'country',
       'Theme_Avg_Price', 'ages_Avg_list_price', 'ages_Avg_num_reviews',
       'ages_Avg_piece_count', 'ages_Avg_play_star_rating',
       'ages_Avg_star_rating', 'ages_Avg_val_star_rating',
       'review_difficulty_Avg_list_price', 'review_difficulty_Avg_num_reviews',
       'review_difficulty_Avg_piece_count',
       'review_difficulty_Avg_play_star_rating',
       'review_difficulty_Avg_star_rating',
       'review_difficulty_Avg_val_star_rating', 'country_Avg_list_price',
       'country_Avg_num_reviews', 'country_Avg_piece_count',
       'country_Avg_play_star_rating', 'country_Avg_star_rating',
       'country_Avg_val_star_rating', 'Has_Trademark', 'Has_Registered'],
      dtype='object')
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
ages list_price num_reviews piece_count play_star_rating prod_desc prod_id prod_long_desc review_difficulty set_name ... review_difficulty_Avg_star_rating review_difficulty_Avg_val_star_rating country_Avg_list_price country_Avg_num_reviews country_Avg_piece_count country_Avg_play_star_rating country_Avg_star_rating country_Avg_val_star_rating Has_Trademark Has_Registered
0 6-12 29.99 2.0 277.0 4.0 Catapult into action and take back the eggs fr... 75823.0 Use the staircase catapult to launch Red into ... Average Bird Island Egg Heist ... 4.529934 4.232244 47.252546 14.564673 421.816401 4.320282 4.507081 4.238505 True False
1 6-12 19.99 2.0 168.0 4.0 Launch a flying attack and rescue the eggs fro... 75822.0 Pilot Pig has taken off from Bird Island with ... Easy Piggy Plane Attack ... 4.490274 4.235066 47.252546 14.564673 421.816401 4.320282 4.507081 4.238505 True False

2 rows × 35 columns

Feature Engineering 2

Create a new column for whether or not the theme name contains a trademark (TM) designation.

df.theme_name.value_counts(normalize=True)[:5]
Star Wars™                   0.112335
DUPLO®                       0.095122
City                         0.089085
Juniors                      0.079785
THE LEGO® NINJAGO® MOVIE™    0.064937
Name: theme_name, dtype: float64
#Your code here
df['Has_Trademark'] = df.theme_name.str.contains('™')
df.Has_Trademark.value_counts(normalize=True)
False    0.721406
True     0.278594
Name: Has_Trademark, dtype: float64

Feature Engineering 3

Create a new column for whether or not the set name contains a registered (R) designation.

#Your code here
df['Has_Registered'] = df.theme_name.str.contains('®')
df.Has_Registered.value_counts(normalize=True)
False    0.744657
True     0.255343
Name: Has_Registered, dtype: float64

Dealing with Null Values

In future algorithms and applications, having null values can be problematic. Due to this, dealing with null values is a common problem in data science. Below are a few options at your disposal.

Subsetting the DataFrame

One option for dealing with null values is simply subseting your data to rows without missing values. You can subset a dataframe according to a criterion like this:

subset = df[df.theme_name=='Angry Birds™']
subset.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
ages list_price num_reviews piece_count play_star_rating prod_desc prod_id prod_long_desc review_difficulty set_name star_rating theme_name val_star_rating country Theme_Avg_Price
0 6-12 29.9900 2.0 277.0 4.0 Catapult into action and take back the eggs fr... 75823.0 Use the staircase catapult to launch Red into ... Average Bird Island Egg Heist 4.5 Angry Birds™ 4.0 US 21.0211
1 6-12 19.9900 2.0 168.0 4.0 Launch a flying attack and rescue the eggs fro... 75822.0 Pilot Pig has taken off from Bird Island with ... Easy Piggy Plane Attack 5.0 Angry Birds™ 4.0 US 21.0211
2 6-12 12.9900 11.0 74.0 4.3 Chase the piggy with lightning-fast Chuck and ... 75821.0 Pitch speedy bird Chuck against the Piggy Car.... Easy Piggy Car Escape 4.3 Angry Birds™ 4.1 US 21.0211
2528 6-12 31.1922 2.0 277.0 4.0 Catapult into action and take back the eggs fr... 75823.0 Use the staircase catapult to launch Red into ... Average Bird Island Egg Heist 4.5 Angry Birds™ 4.0 CA 21.0211
2529 6-12 19.4922 2.0 168.0 4.0 Launch a flying attack and rescue the eggs fro... 75822.0 Pilot Pig has taken off from Bird Island with ... Easy Piggy Plane Attack 5.0 Angry Birds™ 4.0 CA 21.0211

You can then chain the .isnull() method along with the ~ which negates an expression to remove null values. For example:

print(len(df))
populated = df[~df.theme_name.isnull()] #The tilde (~) negates the conditional, turning all True values False and vice versa
print(len(populated))
12261
12258

Removing Null Values

Practice subsetting the dataframe by removing all entries where the star_rating is not populated.

#Your code here
subset = df[~df.star_rating.isnull()]

Imputing Missing Values

Another option for corraling data with missing values is to impute an average (or other) value. For example, rather then dropping all rows where there is no star_rating, we could impute value such as the average star_rating for all sets, or the average star_rating for similar sets. You'll practice an initial example of that here.

Update the Star Rating Column

Update the star_rating column for those entries where there is no value. Do this by filling in the average value.

#Your code here
avg = df.star_rating.mean()
print(avg)
df.star_rating = df.star_rating.fillna(value=avg)
4.514134009961459

ds-skills-cleaning-transforming-nyc-ds-091018's People

Contributors

mathymitchell avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.