Coder Social home page Coder Social logo

ajcr / 100-pandas-puzzles Goto Github PK

View Code? Open in Web Editor NEW
2.5K 75.0 2.7K 168 KB

100 data puzzles for pandas, ranging from short and simple to super tricky (60% complete)

License: MIT License

Jupyter Notebook 100.00%
pandas numpy python data-analysis

100-pandas-puzzles's Introduction

100 pandas puzzles

Inspired by 100 Numpy exerises, here are 100* short puzzles for testing your knowledge of pandas' power.

Since pandas is a large library with many different specialist features and functions, these excercises focus mainly on the fundamentals of manipulating data (indexing, grouping, aggregating, cleaning), making use of the core DataFrame and Series objects. Many of the excerises here are straightforward in that the solutions require no more than a few lines of code (in pandas or NumPy - don't go using pure Python!). Choosing the right methods and following best practices is the underlying goal.

The exercises are loosely divided in sections. Each section has a difficulty rating; these ratings are subjective, of course, but should be a seen as a rough guide as to how elaborate the required solution needs to be.

Good luck solving the puzzles!

* the list of puzzles is not yet complete! Pull requests or suggestions for additional exercises, corrections and improvements are welcomed.

Overview of puzzles

Section Name Description Difficulty
Importing pandas Getting started and checking your pandas setup Easy
DataFrame basics A few of the fundamental routines for selecting, sorting, adding and aggregating data in DataFrames Easy
DataFrames: beyond the basics Slightly trickier: you may need to combine two or more methods to get the right answer Medium
DataFrames: harder problems These might require a bit of thinking outside the box... Hard
Series and DatetimeIndex Exercises for creating and manipulating Series with datetime data Easy/Medium
Cleaning Data Making a DataFrame easier to work with Easy/Medium
Using MultiIndexes Go beyond flat DataFrames with additional index levels Medium
Minesweeper Generate the numbers for safe squares in a Minesweeper grid Hard
Plotting Explore pandas' part of plotting functionality to see trends in data Medium

Setting up

To tackle the puzzles on your own computer, you'll need a Python 3 environment with the dependencies (namely pandas) installed.

One way to do this is as follows. I'm using a bash shell, the procedure with Mac OS should be essentially the same. Windows, I'm not sure about.

  1. Check you have Python 3 installed by printing the version of Python:
python -V
  1. Clone the puzzle repository using Git:
git clone https://github.com/ajcr/100-pandas-puzzles.git
  1. Install the dependencies (caution: if you don't want to modify any Python modules in your active environment, consider using a virtual environment instead):
python -m pip install -r requirements.txt
  1. Launch a jupyter notebook server:
jupyter notebook --notebook-dir=100-pandas-puzzles

You should be able to see the notebooks and launch them in your web browser.

Contributors

This repository has benefitted from numerous contributors, with those who have sent puzzles and fixes listed in CONTRIBUTORS.

Thanks to everyone who has raised an issue too.

Other links

If you feel like reading up on pandas before starting, the official documentation useful and very extensive. Good places get a broader overview of pandas are:

There are may other excellent resources and books that are easily searchable and purchaseable.

100-pandas-puzzles's People

Contributors

499244188 avatar ajcr avatar g-morishita avatar guiem avatar johink avatar johnny5550822 avatar madrury avatar pleydier avatar xonoma avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

100-pandas-puzzles's Issues

Sorting not needed in solution to question 27.

  1. A DataFrame has a column of groups 'grps' and and column of numbers 'vals'. For each group, find the sum of the three greatest values.

The solution starts with sorting the 'vals' column - this is not needed. The nlargest method selects the three greatest values irrespective of the order of element.

Suggestion: delete the sorting, the solution is provided by just by the second line of code.

Correction Request for solution to Q53:

The following code raises error with pandas 1.5.3:

df['adjacent'] = (counts - mine_grid).ravel('F')

reporting pandas DataFrame doesn't have any method named ravel.

How about correcting it as:

df['adjacent'] = (counts - mine_grid).values.ravel('F')

NaN problem with question 21

Thanks for the project.
When i working with question#21 using pandas1.2.4. It needs to fillna first.

df['age'] = df['age'].fillna(0)
df.pivot_table(index='animal', columns='visits', values='age', aggfunc='mean')

Correction for Q16 (partA)

  1. Append a new row 'k' to df with your choice of values for each column.
df.loc['k'] = [5.5, 'dog', 'no', 2] 

I think it would be better to add new row accoring to the columns as follows:

df.loc['k'] = ['dog', 5.5, 2, 'no',] 

Since The Data looks like:

animal age visits priority
cat 2.5 1 yes
cat 3.0 3 yes
snake 0.5 2 no
dog NaN 3 yes

The second solution to Q29 does not work propery.

The solution below (Q29-2) output wrong answer when I input dataframe whose value starts with zero.

x = (df['X'] != 0).cumsum()
y = x != x.shift()
df['Y'] = y.groupby((y != y.shift()).cumsum()).cumsum()

In this code, Series y has to have True where its value is not zero and False otherwise.
However, the first value of y become True in any case.

e.g.

df1 = pd.DataFrame({'X': [0, 2, 0, 3]})
df2 = pd.DataFrame({'X': [1, 2, 0, 3]})

x = (df1['X'] != 0).cumsum()
y = x != x.shift()
print(y[0])

x = (df2['X'] != 0).cumsum()
y = x != x.shift()
print(y[0])

outputs

True
True

This bug can be fixed by replacing first two lines into y = df['X'] != 0

Here's the code to compare the results between the solution 1 , solution 2 and modified solution2.

import pandas as pd
import numpy as np
df = pd.DataFrame({'X': [0, 2, 0, 3, 4, 2, 5, 0, 3, 4]})

def solution1(df):
    izero = np.r_[-1, (df['X'] == 0).nonzero()[0]] # indices of zeros
    idx = np.arange(len(df))
    return pd.Series(idx - izero[np.searchsorted(izero - 1, idx) - 1])

def solution2(df):
    x = (df['X'] != 0).cumsum()
    y = x != x.shift()
    return y.groupby((y != y.shift()).cumsum()).cumsum()

def solution2_modified(df):
    y = df['X'] != 0
    return y.groupby((y != y.shift()).cumsum()).cumsum()

check_df = pd.concat([df, solution1(df), solution2(df), solution2_modified(df)], axis=1)
check_df.columns = ['input_df', 'solution1', 'solution2', 'solution2_modified']
display(check_df)
input_df solution1 solution2 solution2_modified
0 0 1 0
2 1 2 1
0 0 0 0
3 1 1 1
4 2 2 2
2 3 3 3
5 4 4 4
0 0 0 0
3 1 1 1
4 2 2 2

I executed these code with Python 3.6.7 & pandas 0.24.0.

Join forces?

Hi Alex,

As a quest to better learn pandas I created a series of exercises in a different form than yours.
I would like to know if you might to be interested to contribute in any way to my repo or if I can use the your exercises.

Thanks

Solution to Question 27 is no longer supported.

I was able to get the correct result with the following:

df.groupby('grps')['vals'].nlargest(3).groupby('grps').sum()

but I'm sure there's a more elegant way to do it than by using the groupby method twice in one line.

Please add a license file

It would be nice if these exercises would have a license, so one knows under which conditions one can make use of them.

I don't have any particular license in mind myself, and of course that's not my call to make, tough in the name of reducing license proliferation I would suggest to use the same license as pandas itself uses: https://github.com/pandas-dev/pandas/blob/master/LICENSE .

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.