Coder Social home page Coder Social logo

pydata-book's Introduction

Python for Data Analysis, 3rd Edition

Materials and IPython notebooks for "Python for Data Analysis, 3rd Edition" by Wes McKinney, published by O'Reilly Media. Book content including updates and errata fixes can be found for free on my website.

Buy the book on Amazon

Follow Wes on Twitter: Twitter Follow

2nd Edition Readers

If you are reading the 2nd Edition (published in 2017), please find the reorganized book materials on the 2nd-edition branch.

1st Edition Readers

If you are reading the 1st Edition (published in 2012), please find the reorganized book materials on the 1st-edition branch.

IPython Notebooks:

License

Code

The code in this repository, including all code samples in the notebooks listed above, is released under the MIT license. Read more at the Open Source Initiative.

pydata-book's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pydata-book's Issues

Python for Data Analysis directory names needs to change.

MovieLens 1M Data Set
Page 23 - the directory names

unames = ['user_id', 'gender', 'occupation', 'zip']
users = pd.read_table('movielens/users.dat,' sep= '::', header = None, names = unames)

USERS.DAT needs to change to ratings.dots
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('movielens/users.dat', header = None, names = rnames )

USERS.DAT needs to change to movies.dots
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table(movielens/users.dat,header = None, names = mnames )

I believe I have the 2013 edition so if you guys already changed my apologies. this is a minor thing but might be annoying for beginners.

great book! thank you so much!

crosstab not working because the 'size' variable is being treated as a number

Hi Wes,

I switched from SPSS to Pandas and one of the few things I miss about SPSS is the way it did crosstabs.

I have been trying to follow along in the notebook for chapter 8 on visualization and there is one point where you make a crosstab with the tips data frame with the code:

party_counts = pd.crosstab(tips.day, tips.size) and you get this nice crosstab table of the number of cases by day and party size (it doesn't paste well):

In [69]: party_counts = pd.crosstab( tips.day, tips.size) In [70]: party_counts Out[ 70]: size 1 2 3 4 5 6 day Fri 1 16 1 1 0 0 Sat 2 53 18 13 1 0 Sun 0 39 15 18 3 1 Thur 1 48 4 5 1 3

McKinney, Wes (2012-10-08). Python for Data Analysis (Kindle Locations 5544-5547). O'Reilly Media. Kindle Edition.

I have tried to fix this by making size into a categorical variable but it has not worked. What I get is a that the days are correctly displayed but the party size variable gets collapsed to the mean, 1708, one column.

I can try to copy it into this text box:

pd.crosstab(tips.day, tips.size, margins=True)
Out[14]:
col_0 1708 All
day
Fri 19 19
Sat 87 87
Sun 76 76
Thur 62 62
All 244 244

Here is the link to my github account.

https://github.com/michaelreinhard/pydata-book/blob/master/ch08.ipynb

I just can't figure out what I am doing wrong. I feel like I am doing just what you do in your code and I am working from your notebook from github. Any advice or suggestions you could make for where to look for the answer or to investigate further would be greatly appreciated. Thanks,

Michael Reinhard

Ch2: an outdated version arguments

In put 116:
older ==> mean_ratings = data.pivot_table('rating', rows='title', cols='gender', aggfunc='mean')
new ==>mean_ratings = data.pivot_table('rating', index='title', columns='gender', aggfunc='mean')

Add one line code

@wesm Dear Wes, you should add the following code on Page 46 , Chapter 3, before In [542].

from numpy.random import randn

otherwise the following code cann't run.

xrange versus range on pg 118

The Random Walks example of pages 117-118 uses xrange instead of range. This is probably left from the previous edition that was oriented to Python 2.

ch7 7.3 String Manipulation

Vectorized String Functions in pandas

In [174]: matches = data.str.match(pattern, flags=re.IGNORECASE)
In [175]: matches
Out[175]:
Dave True
Rob True
Steve True
Wes NaN
dtype: object``

does the str.match correct?

in the past

it is
Dave (dave, google, com)
Rob (rob, gmail, com)
Steve (steve, gmail, com)
Wes NaN
dtype: object

Ch 2, p. 24 error with code for operating_system

Wes,

Can you clarify where the problem is with the error I got when typing in the equation for operating_system on p. 24?

Thanks

LEB

operating_system = np.where(cframe['a'].str.contains('Windows'),'Windows','Not Windows')

returned the following error:

AttributeError Traceback (most recent call last)
/Users/lauralea/Documents/pydata-book-master/ in ()
----> 1 operating_system = np.where(cframe['a'].str.contains('Windows'),'Windows','Not Windows')

AttributeError: 'Series' object has no attribute 'str'

How to read CSV data from variable into PD DataFrame

Hello Wes,

Love your book, but I'm stumped by not being able to just read a CSV "variable" into a PD dataframe.

In my case, I get the IRIS data (for starting point of the project) POSTed into a Flask web service. (Later I'll use my own CSV data.)

I then extract the text from the Flask's POST request.values. All good. But at that point, I can't figure out how to get the pandas dataframe like your pd.read_csv does.

I'm assuming reading a variable into a PD DF should be difficult since it seems like an obvious use-case.


I tried wrapping with io.StringIO(csv_data), then following up with read_csv on that variable, but that doesn't work either.

I'm writing since I can't find a page in your book where it describes how to simply read a variable into a PD dataframe.

Note: I also tried things like ...

data = pd.DataFrame(csv_data, columns=['....'])

but got nothing but errors (for example, "constructor not called correctly!")

--

I guess my problem is that your (excellent) book shows how to create DataFrames from entered data, and it shows how to create a DataFrame using files, but not (that I can find) how to create a DataFrame using csv data...in a variable.

I am hoping for a simple method to call that can infer the columns and names and create the DataFrame for me, from a variable, without me needing to know a lot about Pandas (just to read the dataset, anyway).

I'm relatively new to Python (a few months) so I'm probably missing something extremely obvious...

Thanks very much,

Solve some question based on the same data set

  1. Top ten most viewed movies with their movies Name (Ascending or Descending order)
  2. Top twenty rated movies (Condition : The movie should be rated/viewed by at least 40 users)
  3. Top twenty rated movies (which is calculated in the previous step) with no of views in the following age group
    (Age group : 1. Young (<20 years), 2. Young Adult(20-40 years), 3.adult (> 40 years) )
    1. Top ten critics (Users who have given very low ratings; Condition : The users should have at least rated 40 movies)

False claim about np.fabs()

Hello,

In Table 4-3, you say to

Use fabs as a faster alternative for non-complex-valued data

However, I don't think fabs is faster in any case. Amortized runs show about 15% slower than abs() for ints, and 1% slower for floats.

Join to translation English to Indonesian

hello Mr., I am a translator, I want to translate English into Indonesian, Because the majority of Indonesian people do not understand English.. I will be sincere if I am in trust

Ch02_file_corrupt?

Hi, I downloaded the data for the book on 2/2/17 and 2/5/17. The usagov... file used on p. 14 has changed quite a bit. Here is the opening snippet of what appears to be the "good" file:

["{ "a": "Mozilla\/5.0

and the latest download (downloaded 2/5/17)

{ "a": "Mozilla/5.0

As you can see the bracket at the beginning is missing as is a good number of slashes (which continues further on in the file) . My IDE returns the following error

Traceback (most recent call last):
File "C:\...Ch_2_import.py", line 7, in
dict_1 = json.loads(lines)
File "C:...\AppData\Local\Continuum\Anaconda3\lib\json_init_.py", line 354, in loads
return _default_decoder.decode(s)
File "C:....\AppData\Local\Continuum\Anaconda3\lib\json\decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:....\AppData\Local\Continuum\Anaconda3\lib\json\decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

A json viewer suggests the more recently downloaded file is not a json formatted file. This makes the opening bits of code on p.14 tough for a neophyte like myself. Am I missing something really obvious? Is there a simple fix when you run into something like this? Here is my code if this helps
(note that I changed the filenme to shorten it). Using Anaconda distribution of Python 3.6 with PyCharm.

import json

path = 'C:...\Python\pydata_book\ch02\test2.txt'

with open(path,) as file_object:
lines = file_object.read()
dict_1 = json.loads(lines)

#num_rec = len(dict_1)
#print(num_rec)

Many thanks for your help - especially all of your previous contributions from which I've benefited.

Some question about demo in p253

I code p253 demo. When I ran it, it shown me error as below (python2.7 and ipython). Syntax error is tbilrate.
Could you please advice how to resolve it?

In [9]: data=macro[['cpi', 'm1, 'tbilrate', 'unemp']]
File "", line 1
data=macro[['cpi', 'm1, 'tbilrate', 'unemp']]

SyntaxError: invalid syntax

Many Thanks

Delete line index # 495 - something is wrong with this record

Hi,

Something is wrong with record 495 from ch02/usagov_bitly_data2012-03-16-1331923249.txt.

It causes error when running these statements:

import json
path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line) for line in open(path)]

Fixed version:

import json
path ='ch02/usagov_bitly_data2012-03-16-1331923249.txt'
temp = [line for line in open(path)]
temp.pop(495)          # delete problematic record
for i in range(len(temp)):
     records[i] = json.loads(temp[i])

Problem with parsing Movie Lens data using code in book

Hi,

I am working through the Ch02 material - and have a problem with the initial reading of the movie lens data. I am running the initial code as in the book:

import pandas as pd
import os
encoding = 'latin1'


upath = os.path.expanduser('pydata-book-master/ch02/movielens/users.dat')
rpath = os.path.expanduser('pydata-book-master/ch02/movielens/ratings.dat')
mpath = os.path.expanduser('pydata-book-master/ch02/movielens/movies.dat')

unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
mnames = ['movie_id', 'title', 'genres']

(with paths amended to work for where I have the files)

but when I run the line:

users = pd.read_csv(upath, sep='::', header=None, names=unames, encoding=encoding)

I get the message:

/Users/Chris/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  if __name__ == '__main__':

I have tried switching to a Python2 kernel - but get an equivalent message.

What is the root of this issue? as far as I can interpret this it is having a problem with specifying the multicharacter '::' as the data separator. But I don't really understand how to correct this. How should I fix it to avoid similar issues with this and future code in the book?

Many thanks

CAN'T FIND EPD_FREE-7.3-1.DMG

I followed the website-instruction , still can't find the epd_free-7.3-1.dmg install-package on Enthought.com . What's EPD used for? Can anyone share a download-site,if not, any substitute can I use to go on within pydata-book?

Time series chapter: strftime and strptime don't use the same format codes

In the part discussing converting datetime objects from strings, you say that strptime uses the same format codes as strftime, but that's not quite right:

value = '2011-01-03'
stamp = datetime.strptime(value, '%Y-%m-%d')  # works
datetime.strptime(value, '%F')  # ValueError: 'F' is a bad directive in format '%F'
datetime.strftime(stamp, '%F')  # works

pydata-book/ch02.ipynb file apprears to be corrupted

Unable to open the pydata-book/ch02.ipynb file.

Got the following error message when I click on the ch02.ipynb link in GitHub
image

When loading into IPython Notebook from the downloaded zip file, get the following error.
image

Any suggestions? Thanks.

Clarification of NumPy Boolean Indexing Copy Behavior

Page 100 of the 2nd edition describes boolean indexing of NumPy arrays. The following remark is made in the middle of the page.

Selecting data from an array by boolean indexing always creates a copy of the data

But then the next two examples demonstrate assignments to a boolean-indexed array. If boolean indexing created copies, wouldn't the original array (data in the examples) be unchanged? Wouldn't the assignment apply to the copy rather than the original?

I verified that indeed, if I perform the following:

mask = data < 0
data2 = data[mask]
data2 = 8

Then data is left unchanged. So data2 does indeed contain a copy and assignment to it does not affect data. In this sense, the statement is correct. But when data[mask] appears on the lefthand side (i.e. as an lvalue), no copy is created. It is assigned in-place.

The upshot of this issue is that a small clarification might be in order for that statement.

XML and HTML: Web scraping

Hi, Wes. In Chapter 6, there's one part that's kind of outdated:

XML and HTML, Web scraping
NB. The Yahoo! Finance API has changed and this example no longer works

Do you mind revise that part and present us a v2.0?

mta_perf path

At the beginning of the 'Parsing XML with lxml.objectify' section the path referring to the mta performance data is written as 'examples/mta_perf/Performance_MNR.xml' instead of 'datasets/mta_perf/Performance_MNR.xml'.

Ch9: datetime object needs to be converted to ordinal number

Just wanted to report a small issue I had with the code posted in ch9 under the heading Annotations and Drawing on a Subplot

from datetime import datetime

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)

data = pd.read_csv('examples/spx.csv', index_col=0, parse_dates=True)
spx = data['SPX']

spx.plot(ax=ax, style='k-')

crisis_data = [
    (datetime(2007, 10, 11), 'Peak of bull market'),
    (datetime(2008, 3, 12), 'Bear Stearns Fails'),
    (datetime(2008, 9, 15), 'Lehman Bankruptcy')
]

for date, label in crisis_data:
    ax.annotate(label, xy=(date, spx.asof(date) + 75),
                xytext=(date, spx.asof(date) + 225),
                arrowprops=dict(facecolor='black', headwidth=4, width=2,
                                headlength=4),
                horizontalalignment='left', verticalalignment='top')

# Zoom in on 2007-2010
ax.set_xlim(['1/1/2007', '1/1/2011'])
ax.set_ylim([600, 1800])

ax.set_title('Important dates in the 2008-2009 financial crisis')

This section didn't work for me. Had to convert the date in the for loop from the object to ordinal number.

The updated for loop looks like this:

for date, label in crisis_data:
    ax.annotate(label, xy=(date.toordinal(), spx.asof(date) + 75),
                xytext=(date.toordinal(), spx.asof(date) + 225),
                arrowprops=dict(facecolor='black', headwidth=4, width=2,
                                headlength=4),
                horizontalalignment='left', verticalalignment='top')

Ch 05: Series.order has been deprecated

Dear pydata-book maintainer,

In Sorting and ranking section of Chapter 5, Series.order function is used:

obj = Series([4, 7, -3, 2])
obj.order()
obj = Series([4, np.nan, 7, np.nan, -3, 2])
obj.order()

Since order function has been deprecated, maybe we should modify it to sort_value like below?

obj = Series([4, 7, -3, 2])
obj.sort_value() # Series.order has been deprecated from v0.17
obj = Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_value() # Series.order has been deprecated from v0.17

Best regards,
Jing Qin

Ch2: Movielens data -- Book path, doesn't match repo path

In the digital version of the book, location 816 of 13309, Ch. 2, Movie Lens 1M Data Set has the following code:

import pandas as pd 

unames = ['user_id', 'gender', 'age', 'occupation', 'zip'] 
users = pd.read_table('ml-1m/users.dat', sep ='::', header = None, names = unames) 

rnames = ['user_id', 'movie_id', 'rating', 'timestamp'] 
ratings = pd.read_table('ml-1m/ratings.dat', sep ='::', header = None, names = rnames) 

mnames = ['movie_id', 'title', 'genres'] 
movies = pd.read_table('ml-1m/movies.dat', sep ='::', header = None, names = mnames)

This path type doesn't work w/ current repo: ml-1m/users.dat
It should be: 'ch02/movielens/users.dat'

CH08, column name conflicted with Dataframe's property

There was one column named size in the file ch08/tips.csv.

tips = pd.read_csv("tips.csv")
# The former one returned the property while I expected the column
tips.size
tips["size"]

So, I wanna know if there is any secure way to avoid this kind of conflicts.

Trying to load Yahoo finance data

Hi,

I am trying to load the Yahoo finance data from the .pkl files to use in place of the 'get_data_yahoo example in the Correlation and Covariance section of Ch05.

But I'm struggling to work out how to unpack the data to use in the example - do you have sample code of how to do this?

thanks

ch11.ipynb Small issue with the to_index

Under Decile and Quantile analysis the to_index function finds the max of a return index and replaces with 1. I believe it should be looking for the first non NaN and replacing the previous entry with a 1.

def to_index(rets):
index = (1 + rets).cumprod()
first_loc = max(index.index.get_loc(index.idxmax()) - 1, 0)
index.values[first_loc] = 1
return index

change to:

first_loc = max(index.index.get_loc(index.notnull().argmax()) - 1,0)

Seems to work - there was a proposed change in the O'Reilly Errata but that solution doesn't count back 1 so effectively ignores the first day return. Hope this helps, I am new to Python; great book.

Typo in book 2nd ed. (Ch 3, p58, slicing)

The example for slicing uses expression
seq=[7, 2, 3, 7, 5, 6, 0, 1]
for seq[-6:-2] the book says that result should be [6, 3, 5, 6], but it is [3, 7, 5, 6].
Just to let you know about that issue.

Large file limit issue when cloning/pushing repo back to GitHub

I just cloned this repo so I can work on it, which involves accessing it from multiple computers, all updated from a personal repo I use to keep everything synced. There is a file, https://github.com/pydata/pydata-book/blob/master/ch09/P00000001-ALL.csv, that is too large to push back to GutHub, so I'm wondering if you're using the new large file support, which is not accessible to everyone else yet without invites. I just put in a request for access, but in the meantime, I'm hoping for other suggestions.

Thanks!

What would everyone like to see in the 2nd edition?

I've started working on the revised 2nd Edition of Python for Data Analysis. The agenda / table of contents is not set in stone, though!

Any comments on the existing content or requests for new content would be welcome here. I can't make any promises, but since I know how useful the book has been for many people the last 3.5 years, I would like to make sure the 2nd edition is just as useful (if not more so!) in the following 3.5 years (which will put us all the way to 2020, if you can believe it).

Thank you all in advance for the support.

ch6: XML and HTML, Web scraping

Hello,

I got a problem when I ran the code of this section on Jupyter Notebook.

It was running all the time and couldn't get the results.

I wonder if there is anything wrong with my code? How can I fix them?

image

Looking forward to your reply.

Can't get it to work

I'm new to your book and trying like crazy---but I seem to be missing a lot somewhere, somehow.
I hope you can help me.
I have tried to work through the usa.gov data from bit.ly and I always get no such file exists errors. The same goes for the movielens data...no matter how I try I get the same errors.
Am I supposed to download those files separately?
I've culled the net looking for better code but I figured you could point me in the right direction or tell me what I'm doing wrong?

Please help. I would hate for this book to have been a terrible waste!!!

Ch2 p18 JSON error

Still having trouble using JSON in Python 3

I have added the 'rb' to open()
records = [json.loads(line) for line in open(path,'rb')]

Now getting error in json.loads

TypeError: the JSON object must be str, not 'bytes'

Any help appreciated
Thanks

Typo in Handling Missing Data

The second paragraph of Handling Missing Data (page 189 in my PDF version) has a typo. Change

 for a lot of usres

to

 for a lot of users.

Add README with nbviewer links

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.