Coder Social home page Coder Social logo

dsc-enterprise-bain-pivot-tables-pandas-lab's Introduction

Pivot Tables with Pandas - Lab

Introduction

In this lab, we'll learn how to make use of our newfound knowledge of pivot tables to work with real-world data. We'll start by exploring

Objectives

You will be able to:

  • Understand and explain what a multi-level hierarchical index is
  • Understand, explain the difference and use df.pivot and pd.pivot_table
  • Switch between “long” and “wide” in a DataFrame using stack() and unstack()
  • Transform “wide” to “long” DataFrames using melt

Getting Started

Import Pandas and Matplotlib.pyplot Using Standard Aliases

In the cell below:

  • Import pandas and set the standard alias
  • Import matplotlib.pyplot and set the standard alias
  • Run the ipython magic command to display matplotlib graphs inline within the notebook

Load the Data

The data for this activity is stored in a file called 'causes_of_death.tsv' which is a somewhat morbid dataset from the center for disease control. Note that the file extension .tsv indicates that this data is formatted slightly differently then the standard .csv, the difference being that it has 'tab seperated values' instead of 'comma seperated values'. As such, pass in the optional parameter delimiter='\t' into the pd.read_csv() method.

df = None

Now, display the head of the DataFrame to ensure everything loaded correctly.

Our data is currently in Wide format. We can tidy this up by converting it to Long format by using groupby statements to aggregate our data into a much neater, more readable format.

Groupby Aggregations

Complete the following groupby statements.

1) Groupby State and Gender. Sum the values.

# Your code here

2) Groupby State and Gender and Race. Find the average values.

# Your code here

3) Groupby Gender and Race. Find the minimum values.

# Your code here

4) Create a bar chart of the total number of deaths by state.

  • Sort your columns in order (ascending or descending are both acceptable).
  • Also make sure to include a title, axes labels and have your graph be an appropriate size.

NOTE: In order to do this, slice the Deaths column after the .groupby() method, but before the sum() method. You can even chain the .plot() call on after the sum() call and do this all on one line, excluding the labeling of the graph!

#Your code here

Inspecting our Data

Let's go one step further and get a print-out of the data type of each column.

In the cell below, get the .info() of our DataFrame, and note the data type that each column is currently stored as.

Let's look at some samples from the Population column to see if the current encoding seems appropriate for the data it contains.

In the cell below, display the population values for the first 5 rows in the DataFrame.

Just to be extra sure, let's check the value counts to see how many times each unique value shows up in the dataset. We'll only look at the top 5.

In the cell below, print out the top 5 value_counts() of the population column of the DataFrame.

Clearly, this data should be stored as a numeric type, not a categorical type.

5 a) Reformat the Population Column as an Integer

As stands, not all values will be able to be reformated as integers. Most of the cells in the the Population column contain integer values, but the entire column is currently encoded in string format because some cells contain the string "Not Applicable".

We need to remove these rows before we can cast the Population column to an Integer data type.

In the cell below:

  • Slice the rows of df where the Population column is equal to 'Not Applicable'.
  • Use to_drop.index to drop the offending rows from df. Be sure to set the axis=0, and inplace=True
  • Cast the Population column to an integer data type using the .astype() function, with the single parameter int64 passed in.
  • Print the Population column's dtype attribute to confirm it is now stored in int64 format.

NOTE: .astype() returns a copy of the column, so make sure you set the Population column equal to what this method returns--don't just call it!

#Your code here
to_drop = None

5 b) Complete the Bar Chart

Now that we've reformatted our data, let's create a bar chart of the of the Mean Population by State.

#Your code here

Below we will investigate how we can combine the pivot method along with the groupby method to combine some cool stacked bar charts!

Using Aggregate Functions

In the cell below:

  • Group df by 'State' and 'Gender', and then slice both 'Deaths' and 'Population' from it. Chain the .agg() method to return the mean, min, max, and standard deviation these sliced columns.

NOTE: This only requires one line of code.

By now, you've probably caught on that the code required to do this follows this pattern: ([things to group by])[columns to slice].agg([aggregates to return])

Then, display the head of this of this new DataFrame.

# A sample groupby similar to above. 
grouped = None

Note how Pandas denotes a multi-hierarchical index in the DataFrame above.

Let's inspect how a multi-hierarchical index is actually stored.

In the cell below, display the index attribute of this DataFrame.

A two-dimensional array denotes the multiple levels, with each possible combination being a row in our grouped DataFrame.

Let's reset the index, and then see how it changes.

In the cell below, call the DataFrame's reset_index() function. Then, display the head of the DataFrame.

# First, reset the index. Notice the subtle difference; State and Gender are now columns rather then the index.
grouped = None

Note how the way the index is displayed has changed. The index columns that made up the multi-hierarchical index before are now stored as columns of data, with each row given a more traditional numerical index.

Let's confirm this by reexamining the index attribute of grouped in the cell below.

However, look again at the displayed DataFrame--specifically, the columns. Resetting the index has caused the DataFrame to use a mutlti-indexed structure for the columns.

In the cell below, examine the columns attribute of grouped to confirm this.

#Notice that this causes columns to be MultiIndexed!

Column Levels

Since we're working with miulti-hierarchical indices, we can examine the indices available at each level.

In the cell below, use the get_level_values method contained within the DataFrame's columns object to get the values for the outermost layer of the index.

Now, get the level values for the inner layer of the index.

Flattening the DataFrame

We can also flatten the DataFrame from a multi-hierarchical index to more traditional one-dimensional index. We do this by creating each unique combination possible of every level of the multi-hierarchical index. Since this is a complex task, you do not need to write it--but take some time to examine the code in the cell below and see if you can understand how it works!

#We could also flatten these:
cols0 = grouped.columns.get_level_values(0)
cols1 = grouped.columns.get_level_values(1)
grouped.columns = [col0 + '_' + col1 if col1 != '' else col0 for col0, col1 in list(zip(cols0, cols1))]
#The list comprehension above is more complicated then what we need but creates a nicer formatting and
#demonstrates using a conditional within a list comprehension.
#This simpler version works but has some tail underscores where col1 is blank:
#grouped.columns = [col0 + '_' + col1 for col0, col1 in list(zip(cols0, cols1))]
grouped.columns

Now that we've flattened the DataFrame, let's inspect a couple rows to see what it looks like.

In the cell below, inspect the head of the grouped DataFrame.

Using Pivots

Now, we'll gain some practice using the DataFrame class's built-in .pivot() method.

In the cell below, call the DataFrame's pivot method with the following parameters:

  • index = 'State'
  • columns = 'Gender'
  • values = 'Deaths_mean'

Then, display the head of our new pivot DataFrame to see what it looks like.

# Now it's time to pivot!
pivot = None

Great! We've just created a pivot table.

Let's reset the index and see how it changes our pivot table.

In the cell below, reset the index of the pivot object as we did previously. Then, display the head of the object to see if we can detect any changes.

# Again, notice the subtle difference of reseting the index:
pivot = None

Visualizing Data With Pivot Tables

Now, we'll make use of our newly created pivot table to quickly create some visualizations of our data.

In the cell below, call pivot.plot() with the following parameters:

  • kind = 'barh'
  • figsize = (15,8)
# Now let's make a sweet bar chart!!

Notice the Y-axis is currently just a list of numbers. That's because when we reset the index, it defaulted to assigning integers as the index for the DataFrame. Let's set the index back to 'State', and then recreate the visualization.

In the cell below:

  • Use the pivot object's set_index() method and set the index to 'State'. Then, chain this with a .plot() call to recreate the visualization using the code we used in the cell above.

All the code in this cell should be done in a single line. Just call the methods--do not rebind pivot to be equal to this line of code.

#Where's the states?! Notice the y-axis is just a list of numbers.
#This is populated by the DataFrame's index.
#When we used the .reset_index() method, we created a new numbered index to name each row. 
#Let's fix that by making state the index again.

Now, that we've created a visualization with the states as the y-axis, let's print out the head of the pivot object again.

# Also notice that if we call the DataFrame pivot again, state is not it's index.
#The above method returned a DataFrame with State as index and we plotted it,
#but it did not update the DataFrame itself.

Note that the index has not changed. That's because the code we wrote when we set the index to the 'State' column returns a copy of the DataFrame object with the index set to 'State'--by default, it does not mutate original pivot object.

If we want to do that, we'll need to capture the new object returned by updating the contents of the pivot variable.

In the cell below, set the index of pivot to 'State'. Then, recreate the bar plot using this new object.

#If we wanted to more permanently change the index we would set it first and then plot:
pivot = None

Again, let's check the head of the DataFrame to confirm that the index structure has changed.

Finally, let's stack these bar charts to see how that looks.

In the cell below, recreate the visualization we did in the cell above, but this time, also pass in stacked=True as a parameter.

# Lastly, let's stack each of these bars for each state.
#Notice we don't have to worry about index here, because we've already set it above.

Stacking and Unstacking DataFrames

Now, let's get some practice stacking and unstacking DataFrames.

Stacking

In the cell below, let's display the head of grouped to remind ourselves of the format we left it in.

As we can see above, grouped is currently in a flattened format, with no hierarchical structure to it's indices.

In the cell below, call the grouped DataFrame's .stack() method.

As we can see, the stack() method has stacked our DataFrame from a flattened format into one with a multi-hierarchical index! This is an easy, quick way to aggregate our data.

Unstacking

Now, we'll explore unstacking with the pivot DataFrame, which is already stacked into a pivot table.

In the cell below, set unstack pivot using the object's .unstack() method. Then, display the object to see how it has changed.

pivot = None

Note that it has unstacked the multi-hierarchical structure of the pivot DataFrame by one level. Let's call it one more time and display the results!

In the cell below, set pivot equal to pivot.unstack() again, and then display the pivot object to see how things have changed.

pivot = None

After calling unstack a second time, we can see that pivot has a flattened structure, since it has been completely unstacked!

Summary

In this lab, we learned how to:

  • Use groupby to stack and slice data conditionally
  • Use aggregate functions in combination with groupby statements
  • Create pivot tables with pandas
  • Leverage pivot tables and groupby statements to create quick visualizations
  • stack and unstack DataFrames

dsc-enterprise-bain-pivot-tables-pandas-lab's People

Contributors

loredirick avatar mathymitchell avatar mike-kane avatar peterbell avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.