brandon-rhodes / pycon-pandas-tutorial Goto Github PK

PyCon 2015 Pandas tutorial materials

License: MIT License

CSS 0.07% Python 0.93% Shell 0.06% Makefile 0.03% Jupyter Notebook 98.91%

pycon-pandas-tutorial's Introduction

Welcome to Brandon’s Pandas Tutorial

The first instance of this tutorial was delivered at PyCon 2015 in Montréal, but I hope that many other people will be able to benefit from it over the next few years — both on occasions on which I myself get to deliver it, and also when other instructors are able to do so.

If you want to follow along with the tutorial at home, here is the YouTube recording of the 3-hour tutorial at PyCon itself:

https://www.youtube.com/watch?v=5JnMutdy6Fw

To make it useful to as many people as possible, I hereby release it under the MIT license (see the accompanying LICENSE.txt file) and I have tried to make sure that this repository contains all of the scripts needed to download and set up the data set that we used.

Quick Start

If you have both conda and git on your system (otherwise, read the next section for more detailed instructions):

$ conda install --yes jupyter matplotlib pandas
$ git clone https://github.com/brandon-rhodes/pycon-pandas-tutorial.git
$ cd pycon-pandas-tutorial
$ build/BUILD.sh
$ jupyter notebook

Detailed Instructions

You will need Pandas, the IPython Notebook, and Matplotlib installed before you can successfully run the tutorial notebooks. The Anaconda Distribution is a great way to get up and running quickly without having to install them each separately — running the conda command shown above will install all three.

Note that having git is not necessary for getting the materials. Simply click the “Download ZIP” button over on the right-hand side of this repository’s front page at the following link, and its files will be delivered to you as a ZIP archive:

https://github.com/brandon-rhodes/pycon-pandas-tutorial

Once you have unpacked the ZIP file, download the following four IMDB data files and place them in the tutorial’s build directory:

ftp://ftp.fu-berlin.de/misc/movies/database/frozendata/actors.list.gz
ftp://ftp.fu-berlin.de/misc/movies/database/frozendata/actresses.list.gz
ftp://ftp.fu-berlin.de/misc/movies/database/frozendata/genres.list.gz
ftp://ftp.fu-berlin.de/misc/movies/database/frozendata/release-dates.list.gz

If the above links don’t work for you, try these alternate sources of the same files:

ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/frozendata/actors.list.gz
ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/frozendata/actresses.list.gz
ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/frozendata/genres.list.gz
ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/frozendata/release-dates.list.gz

To convert these into the CSV files that the tutorial needs, run the BUILD.py script with either Python 2 or Python 3. It will create the three CSV files in the data directory that you need to run all of the tutorial examples. It should take about 5 minutes to run on a fast modern machine:

$ python build/BUILD.py

You can then start up the IPython Notebook and start looking at the notebooks:

$ ipython notebook

I hope that the recording and the exercises in this repository prove useful if you are interested in learning more about Python and its data analysis capabilities!

— Brandon Rhodes

pycon-pandas-tutorial's People

Contributors

Stargazers

Watchers

Forkers

poguez lenovor ryanzotti mmkuang adrianhust plucena24 pcuci eronisko cgopalan emptist mbirdi sastels makalaaneesh bwinkel bertji rishkarajgi kabalaandrew pschiavone kvnphllps barnesl mjk276 mpnussbaum wkuling artykreaper zeyaddeeb ike-okonkwo skeefauv somu-analyst markaufdencamp bsherwin buenosjumper ehazrati cluo hoelzi srisai85 tsendil deepesch ledrui jorgja02 willingc dethakur adzze recluze baomingtang bkg6 vinisr xy008areshsu kenthuang j20x clickpn eotp tandon-aman joeeoj sguberman wen2012 abhijaypatne weimaiyidong pgnepal amysjsj jk100a manaranjanp iampawansingh tinacloud shaun10 hmallajosyula spurnaye upendrak zalefar alistairwalsh ishulga chessiq lenley michaelreinhard shaykayjay migueleh codetasks sagarkar10 flyingpoops calam1 bwlv cindylubbers tk575 medminus9 mcaruana nywfan iamthuypham mcdallas amit-dingare cdannen jmportilla mnalevanko jpjimeneznavarro sbagri dudullz lizsz ctmann kksanz chuangmu1990 wanglfjp binaryannamolly

pycon-pandas-tutorial's Issues

Exercise 2 - Plotting n values over career questions


# Plot the n-values of the roles that Judi Dench has played over her career.
c = cast
c = c[c.name == 'Judi Dench'].sort_values('year')
c = c[c.n.notnull()]
c.plot(x='year', y='n', kind='scatter')

As far as plotting is concerned, do we really need sort_values() and notnull() here? Null values are ignored in plotting and the x series is automatically sorted as per my understanding. We can keep these solutions, but having a comment to the effect will add clarity.

On cheatsheet is rename command correct?

Line 99 - df.rename({'a': 'y', 'b': 'z'})

I'm wondering whether it should be df.rename(columns={'a': 'y', 'b': 'z'}).

At the least, this was the pattern I needed to change my column names on pandas 0.16.0.

Leading couple

Hey,
Great tutorial!
finished all exercised and is playing around some more with the data.
Trying to figure out credits for the leading couple (n=1, and n=2), seeing their actor/actress distribution across the years. Solved it this way:

c = cast
c = c[c.n <= 2]
m = c.merge(c, on=['title', 'year'])
m = m[(m.n_y != m.n_x) & (m.n_x==1.0)]
quad = m.groupby(['year','type_x', 'type_y']).size().unstack().unstack().fillna(0)
quad['totals'] = quad['actor', 'actor'] + quad['actor', 'actress'] + quad['actress', 'actor'] + quad['actress', 'actress']
quad['actor', 'actor'] = quad['actor', 'actor'] / quad['totals']
quad['actor', 'actress'] = quad['actor', 'actress'] / quad['totals']
quad['actress', 'actor'] = quad['actress', 'actor'] / quad['totals']
quad['actress', 'actress'] = quad['actress', 'actress'] / quad['totals']
quad.drop(['totals'], axis=1).plot(ylim=(0,1), xlim=(1915,2019))
quad.totals.sum()

The sum of totals here is significantly larger than this amount:

len(cast[(cast.n <= 2)].groupby('title'))

I'm struggling to figure out why. Would love some guidance :)

Thanks!
Dan

Is this tutorial up to date?

I just see that it was created in 2015.

genres.list.gz is not parsed by BUILD.py

When running BUILD.py with all 4 data files in the build directory, I've noticed that only 3 csv files are generated under /data
cast.csv
release_dates.csv
titles.csv

Looking at BUILD.py, genres.csv is never written. No error message is displayed.

See the attached screenshot showing BUILD.py running,

missing data

where are the titles.csv and cast.csv data files.

cannot run BUILD.py

Hi Brandon,

I am dying to start following your video. but I cannot run this file properly. this is the error message. please help!!

Thank you!

David

Can't convert the data

This is the error that I get:

C:\Users\user\Desktop\pycon-pandas-tutorial-master\build>python build.py
Traceback (most recent call last):
  File "build.py", line 221, in <module>
    main()
  File "build.py", line 12, in main
    os.chdir(os.path.dirname(__file__))
OSError: [WinError 123] The filename, directory name, or volume label syntax is
incorrect: ''

Exercises-3 Solution | Use groupby() to determine how many roles are listed for each of the Pink Panther movies.

Hey Brandon. Great tutorial, and you're an excellent teacher.

The solution provided for the above mentioned solution is:

c = cast
c = c[c.title == 'The Pink Panther']
c = c.sort_values('n').groupby(['year'])[['n']].max()
c

result:
n
year
1963 15
2006 50

However, in effort to include all Pink Panther movies.. I think it should be:

c = cast
c = c[c.title.str.contains("Pink Panther")]
c.groupby(['title', 'year']).n.max()

result:

title year
Curse of the Pink Panther 1983 63.0
Revenge of the Pink Panther 1978 57.0
Son of the Pink Panther 1993 43.0
The Pink Panther 1963 15.0
2006 50.0
The Pink Panther 2 2009 36.0
The Pink Panther Strikes Again 1976 60.0
The Return of the Pink Panther 1975 27.0
Trail of the Pink Panther 1982 32.0
Name: n, dtype: float64

Overall, I think it depends on the questions phrasing. If the question is focused only on "The Pink Panther", the "T" in "The" needs capitalized in the question. Thanks again.

i also have error when python Build.py

Reading "genres.list.gz" to find interesting movies
Found 235702 titles
Writing "titles.csv"
Finished writing "titles.csv"
Reading release dates from "release-dates.list.gz"
Finished writing "release_dates.csv"
Reading 'actors.list.gz'
Traceback (most recent call last):
File "build/BUILD.py", line 224, in
main()
File "build/BUILD.py", line 119, in main
for line in lines:
File "/usr/lib/python2.7/gzip.py", line 464, in readline
c = self.read(readsize)
File "/usr/lib/python2.7/gzip.py", line 268, in read
self._read(readsize)
File "/usr/lib/python2.7/gzip.py", line 315, in _read
self._read_eof()
File "/usr/lib/python2.7/gzip.py", line 354, in _read_eof
hex(self.crc)))
IOError: CRC check failed 0xcf2f2e89 != 0xf6d69ce7L

creating Cast csv file

Hi,
I am trying to create csv file for the exercises.
As far as i understand the titles created perfectly ,but with Cast file i have some issues.
While building those csv this is the error i am getting.
Can you please help me to solve this error?
thanks

(base) PS D:\Python scripts\pycon-pandas-tutorial> python build/BUILD.py
Reading "genres.list.gz" to find interesting movies
Found 226013 titles
Writing "titles.csv"
Finished writing "titles.csv"
Reading release dates from "release-dates.list.gz"
Finished writing "release_dates.csv"
Reading 'actors.list.gz'
Traceback (most recent call last):
File "build/BUILD.py", line 229, in
File "build/BUILD.py", line 176, in main
File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x99' in position 38: character maps to

Exercise 6 Title Casing the 'title' column in cleanup

Working through Exercise 6, title casing the 'title' column in the final dataframe (or in the sales2 dataframe) is useful to catch the the difference between 'Pining For The Fisheries of Yore' in sales1 and 'Pining for the Fisheries of Yore' in sales2

# title case 'title' column
df['title'] = df['title'].str.title()

Handling UTF-8 Encoding for cast.csv

Hello,

I'm just getting starting working through the exercises after watching the PyCon talk on YouTube. I'm having trouble reading in the 'cast.csv' file. In particular, I can see that the names with accented characters confuse my dataframe, and come up as question marks. See below:

In the above screenshot I try to read in the csv by adding encoding = 'utf-8' to the .read_csv function's parameters. Clearly this is not working, but I have been unable to implement the fix on my own after searching the web for almost an hour; I should note the first thing I tried was using the commands detailed on the cheat sheet (import sys; reload(sys); sys.setdefaultencoding('utf-8')), but these don't seem to be compatible in python3 and internet communities say to avoid this set of commands.

I would appreciate any help on this issue - looking to continue working on the exercises. Thanks!

please help me with the build.py issue

= RESTART: /Users/haribezaldein/pycon-pandas-tutorial-master/build/BUILD.py =
Traceback (most recent call last):
File "/Users/haribezaldein/pycon-pandas-tutorial-master/build/BUILD.py", line 226, in
main()
File "/Users/haribezaldein/pycon-pandas-tutorial-master/build/BUILD.py", line 22, in main
lines = iter(gzip.open('genres.list.gz'))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/gzip.py", line 53, in open
binary_file = GzipFile(filename, gz_mode, compresslevel)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/gzip.py", line 163, in init
fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'genres.list.gz'

BUILD.py - TypeError with option 'wb'

On Windows with Python 3.5, I get the following error with 'wb' option for the open file statement. As advised, with just the 'w' option, blank lines are added to the file. Below is the error message.

\build\BUILD.py", line 58, in main
output.writerow(('title', 'year'))
TypeError: a bytes-like object is required, not 'str'

Data not downloadable

I can't find data at ftp://ftp.fu-berlin.de/pub/temporaryaccess/misc/movies/database/actors.list.gz
I think the data has been removed.
I also checked http://www.imdb.com/interfaces.

Be more specific with English in questions

For example, the question 'How many movies have the title "Hamlet"?'
If you assume this means title EXACTLY equal to "Hamlet" you get
(titles['title'] == 'Hamlet').sum() RETURNS 18
but
this could also mean, any movies with the word Hamlet in its title, which is more complicated as
(titles['title'].str.contains('Hamlet', case=False)).sum() RETURNS 54

NaN values in cast.title prevent c.loc[] from being fast after .sort_index()

Hi Brandon, this is a wonderful tutorial, thanks so much for making it.

The issue I encountered is that running c = cast.set_index(['title']).sort_index() didn't speed up the subsequent search with c.loc['Sleuth']. (This relates to the one-minute segment of the tutorial on YouTube starting at 1:08:54 https://youtu.be/5JnMutdy6Fw?t=4134)

I think the problem is that my version of the cast dataframe has six movies with NaN as the title. When these NaN values get into the index, sorting the index doesn't speed up c.loc['Sleuth']. At least I think this is true based on testing I did with randomly generated dataframes with and without NaN in the index.

I fixed it by making a copy of the cast dataframe without those six movies (the movies with NaN titles), followed by setting and sorting the title index, like this:

c = cast[cast.title.notnull()]
c = c.set_index(['title']).sort_index()

Running c.loc['Sleuth'] on this new NaN-free dataframe is very fast, as expected.

It's possible that I made a mistake when downloading the original data and running the build to make cast. Either way, I thought I should mention this in case someone else has the same issue.

Solutions-1.ipynb doesn't work

Hi brandon, and thanks for your wonderful tutorial.

there seems to be an issue with solutions-1.ipynb .. it doesn't open and gives the following error:

"Error loading notebook
Unreadable Notebook: D:\Data Science\pandas-brendon\Solutions-1.ipynb NotJSONError('Notebook does not appear to be JSON: '{\n "cells": [\n {\n "cell_type": "c...',)"

the other notebooks work like a charm though.

I noticed you made a recent commit to this specific recently. This could be the reason?

Thanks again

downloaded data corrupted

Whoops. Not corrupted, just very different from the original data used in the PyCon demonstration.

(I would delete this issue, but don't seem to have that option).

error running python BUILD.py

python BUILD.py
Reading "genres.list.gz" to find interesting movies
Found 234315 titles
Writing "titles.csv"
Traceback (most recent call last):
File "BUILD.py", line 224, in
main()
File "BUILD.py", line 58, in main
output.writerow(('title', 'year'))
TypeError: a bytes-like object is required, not 'str'

Can we remove redundant exercises?

Hi Brandon,

Appreciate your time and effort in coming up with so many exercises for us to practice. However, it would beneficial if the exercises are not repetitive - for example, out of these 2 questions, only 1 can be kept.

How many movies were made in the year 1950?

len(titles[titles.year == 1950])

How many movies were made in the year 1960?

len(titles[titles.year == 1960])

every exercise should contain something new so that my time as a student practising is well-spent. When there are repetitions, I feel that I should stop doing more exercises, but there are some valuable exercises at the bottom which test different things - hence I would NOT want to miss out on them.

Guidence for completing the ground-up learning..

Hi,
From your experience, is there an advice or recommendation with a book or course that could significantly help in building over the ground knowledge I have built through your lecture and exercise materials? or just work through problems and discover from the library documentation?
Is there an advice from your learning experience I should care about so I could reach the easy organized understanding of the information like that you introduced in your lecture which I rarely and hardly get?
Thanks,

Can't get data for the tutorial

@brandon-rhodes Great tutorial, thanks!

Problem:
When i run BUILD.sh the error happens, how can i get the imdbpy2sql.py?

/BUILD.sh: line 8: imdbpy2sql.py: command not found
chmod: *.csv: No such file or directory
rm: *.db: No such file or directory

dataframe sort() doesn't exist any more

(I have consistently touted this as one of the best tutorials in any topic in programming that I have read. I keep coming back to this every time I have been away from Pandas for a while and have to refresh my memory. Thanks for ever for doing this.)

Perhaps an errata would be helpful to correct for the API changes that have happened so far.

For one, I found out that dataframe .sort() doesnt exist any more and .sort_values() should be used instead.

NameError: name 'titles' is not defined

Hi All

I have downloaded and unzipped the necessary zip file from:

"pycon-pandas-tutorial-master.zip"
https://github.com/brandon-rhodes/pycon-pandas-tutorial

I receive the following error when I try to execute the len(titles) code for the follwing question:

How many movies are listed in the titles dataframe?
In [1]:

len(titles)

NameError                                 Traceback (most recent call last)
<ipython-input-1-26ef20dbfe5c> in <module>()
----> 1 len(titles)

**NameError: name 'titles' is not defined**

Is there another file required?
Also where is the "Titles.csv" file?

Kind Regards
Hiten

Solutions-1: "how many people have played ..." questions miss `.name.unique()`

The answer in the solution counts how many movies with a character with that name there are in our cast dataframe, not how many actors have played a character: if an actor plays the same character in two different movies, they should be counted once, not twice.

The answer should be:

len(cast[cast.character == "Ophelia"].name.unique())
# gives 100

instead of

c = cast
c = c[c.character == 'Ophelia']
len(c)
# gives 102

It only impacts Ophelia and The Stranger, as far as I see. (And Sidney Poitier and Judi Dench in the following questions about their roles, if by role we mean character [I don't know if we should].)

FileNotFoundError: File b'data/titles.csv' does not exist

Brandon.
First, Thank you! for creating this great lesson and video for our collective benefit.

I'm a newbie, so please assume little to no knowledge on my part.

I was able to get the first two cells to run without error (importing MPL and activating the CSS to make the dataframes more readable, but as soon as I do anything else, I start getting these long daunting error messages. (please see below for what I get when I attempt to run the third cell).

I can't even go to the bottom, add a few extra cells and get the 'titles' df to reveal itself in the 'first20/last20' format. When I go to the bottom, type 'titles' , hit , I get the following error message:

NameError Traceback (most recent call last)
in ()
----> 1 titles

NameError: name 'titles' is not defined

Could you please tell me what I've done wrong? I would like to follow this lesson and practice the exercises.

Example of error message on third cell (referenced above):

FileNotFoundError                         Traceback (most recent call last)
<ipython-input-7-e0ddc450ec2e> in <module>()
----> 1 titles = pd.DataFrame.from_csv('data/titles.csv', index_col=None, encoding = 'utf-8')
      2 titles.head()

C:\Users\Patrick\Anaconda3\lib\site-packages\pandas\core\frame.py in from_csv(cls, path, header, sep, index_col, parse_dates, encoding, tupleize_cols, infer_datetime_format)
   1259                           parse_dates=parse_dates, index_col=index_col,
   1260                           encoding=encoding, tupleize_cols=tupleize_cols,
-> 1261                           infer_datetime_format=infer_datetime_format)
   1262 
   1263     def to_sparse(self, fill_value=None, kind='block'):

C:\Users\Patrick\Anaconda3\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    653                     skip_blank_lines=skip_blank_lines)
    654 
--> 655         return _read(filepath_or_buffer, kwds)
    656 
    657     parser_f.__name__ = name

C:\Users\Patrick\Anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
    403 
    404     # Create the parser.
--> 405     parser = TextFileReader(filepath_or_buffer, **kwds)
    406 
    407     if chunksize or iterator:

C:\Users\Patrick\Anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds)
    760             self.options['has_index_names'] = kwds['has_index_names']
    761 
--> 762         self._make_engine(self.engine)
    763 
    764     def close(self):

C:\Users\Patrick\Anaconda3\lib\site-packages\pandas\io\parsers.py in _make_engine(self, engine)
    964     def _make_engine(self, engine='c'):
    965         if engine == 'c':
--> 966             self._engine = CParserWrapper(self.f, **self.options)
    967         else:
    968             if engine == 'python':

C:\Users\Patrick\Anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, src, **kwds)
   1580         kwds['allow_leading_cols'] = self.index_col is not False
   1581 
-> 1582         self._reader = parsers.TextReader(src, **kwds)
   1583 
   1584         # XXX

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.__cinit__ (pandas\_libs\parsers.c:4209)()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source (pandas\_libs\parsers.c:8873)()

FileNotFoundError: File b'data/titles.csv' does not exist

I found some strange characters in the cast.csv. Is this normal?

See the name column. Not sure if this is correct or not.

Data file missing

date file missing

Problem no. 3, 5, 6 and 7 in Exercise-4

The solutions for problem no. 3, 5, 6 and 7 in Exercise-4 appear to be missing plotting data for the years with no roles for actresses(viz. 1900, 1905, 1907, 1909).
Can be verified by plotting a subset (using head() with plot()).
Can be fixed (See below) by using "fillna(0)" while 'unstacking' the series to df.
Surprisingly, the area plot (kind = 'area'), used in problem no.4, does not get affected by NaNs.

# Plot the number of actor roles each year
# and the number of actress roles each year
# over the history of film.

c = cast 
c = c.groupby(['year', 'type']).size()
#c = c.unstack('type')              # Causing missing data in plot for NaNs
c = c.unstack('type').fillna(0)     # No missing data 
c.plot()                            # Verify by c.head(10).plot()




# Plot the difference between the number of actor roles each year
# and the number of actress roles each year over the history of film.

c = cast
c = c.groupby(['year', 'type']).size()
#c = c.unstack('type')            # Missing data 
c = c.unstack('type').fillna(0)   # No missing data  
(c.actor - c.actress).plot()



# Plot the fraction of roles that have been 'actor' roles
# each year in the history of film.
c = cast
c = c.groupby(['year', 'type']).size()
#c = c.unstack('type')            # Missing data 
c = c.unstack('type').fillna(0)   # No missing data
c1 = c.head(100)
(c1.actor / (c1.actor + c1.actress)).plot(ylim=[0,1])



# Plot the fraction of supporting (n=2) roles
# that have been 'actor' roles
# each year in the history of film.

c = cast
c = c[c.n == 2]
c = c.groupby(['year', 'type']).size()
#c = c.unstack('type')            # Missing data 
c = c.unstack('type').fillna(0) # No missing data
(c.actor / (c.actor + c.actress)).plot(ylim=[0,1])

BUILD.py file doesn't work

I tried to run the BUILD.py file and got this error:

Any idea why?

Exercise 1 - Treasure Island movies question

Hi Brandon,

Thanks for an excellent tutorial.

Your answer to q "List all of the "Treasure Island" movies from earliest to most recent." is
titles[titles.title == 'Treasure Island'].sort_values('year')

There are a few Treasure Island movies (e.g. Treasure Island (II) ) which will not make it through this filter.

This could be addressed by changing the answer to the following:
titles[titles.title.str.startswith('Treasure Island')].sort_values('year')

Or perhaps the question is a little ambiguous.

All the best,

Richard

‘’FTP‘’ protocol problem

Hello, can you provide a more feasible way to download data sets?thanks

In README.md insutructions, switch to purely conda install ipython ..?

Is it necessary to install ipython-notebook instead of just using ipython in the initial setup instructions? Was thinking maybe we could update readme to install ipython with notebook built-in.

Final Problem Exercise-4

# Build a plot with a line for each rank n=1 through n=3,
# where the line shows what fraction of that rank's roles
# were 'actor' roles for each year in the history of film.
Your solution below appears to be missing data for n = NaN when plotted:
c = cast
c = c[c.n <= 3]
c = c.groupby(['year', 'type', 'n']).size()
c = c.unstack('type')
r = c.actor / (c.actor + c.actress)
r = r.unstack('n')
r.plot(ylim=[0,1])

I believe that if you don't do fillna(0) on the c df "n" data that you lose fraction in the plot from NaN.

c = cast
c = c[c.n <= 3]
c = c.groupby(['year','type','n']).size()
c = c.unstack('type').fillna(0)
f = c.actor/(c.actor+c.actress)
f = f.unstack('n')
f.plot(ylim=[0,1])

ftp data sites do not work

Hi, the data files do not successfully download from the ftp sites. Is there some trick or setting that helps this to happen?

Error converting lists into csv files

Got the following output running build.py:

Reading "genres.list.gz" to find interesting movies
Found 216537 titles
Writing "titles.csv"
Finished writing "titles.csv"
Reading release dates from "release-dates.list.gz"
Finished writing "release_dates.csv"
Reading 'actors.list.gz'
Reading 'actresses.list.gz'
Traceback (most recent call last):
  File ".\BUILD.py", line 221, in <module>
    main()
  File ".\BUILD.py", line 132, in main
    if not_a_real_movie(fields[1]):
IndexError: list index out of range

Titles.csv file?

Hi! I found this tutorial on YouTube and it's been very helpful. I tried doing some of the exercises but I wasn't able to find the titles.csv file. I'm not sure if I missed a link for it during the video or if there's a file in the repository that I'm missing. Thank you!

BUILD.py doesn't work

BUILD.py doesn't work, and I've wasted at least an hour trying to convert your gz files into csv file.

Why can't you just make the csv files directly available in your github and not make us go throught this whole rigamarole?

Thanks,

problem with BUILD.py running !!

Hi Brandon,
Thank you very much for your great lecture, I really dying to practice the codes but this strange error message keeps coming, please help!!!
Thanks,

---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
C:\Users\Ahmed\AppData\Local\Temp\Rar$DIa0.595\BUILD.py in <module>()
    221 
    222 if __name__ == '__main__':
--> 223      main()

C:\Users\Ahmed\AppData\Local\Temp\Rar$DIa0.595\BUILD.py in main()
    116         assert b'----' in next(lines)
    117 
--> 118         for line in lines:
    119             if line.startswith(b'----------------------'):
    120                 break

C:\Users\Ahmed\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.7.4.3348.win-x86_64\lib\gzip.pyc in readline(self, size)
    462         bufs = []
    463         while size != 0:
--> 464             c = self.read(readsize)
    465             i = c.find('\n')
    466 

C:\Users\Ahmed\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.7.4.3348.win-x86_64\lib\gzip.pyc in read(self, size)
    266             try:
    267                 while size > self.extrasize:
--> 268                     self._read(readsize)
    269                     readsize = min(self.max_read_chunk, readsize * 2)
    270             except EOFError:

C:\Users\Ahmed\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.7.4.3348.win-x86_64\lib\gzip.pyc in _read(self, size)
    313         if buf == "":
    314             uncompress = self.decompress.flush()
--> 315             self._read_eof()
    316             self._add_read_data( uncompress )
    317             raise EOFError, 'Reached EOF'

C:\Users\Ahmed\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.7.4.3348.win-x86_64\lib\gzip.pyc in _read_eof(self)
    352         if crc32 != self.crc:
    353             raise IOError("CRC check failed %s != %s" % (hex(crc32),
--> 354                                                          hex(self.crc)))
    355         elif isize != (self.size & 0xffffffffL):
    356             raise IOError, "Incorrect length of data produced"

IOError: CRC check failed 0xad043ea0L != 0x8a9d3edL

Error converting to csv files

Hello,

I keep getting this error when I try to convert the data in the build file into csv files:

 File "Build.py", line 217
    return s.decode<"ascii", "replace">.replace<u'\ufffd', u'?'>
SyntaxError: invalid syntax

I have tried deleting all of the zip files for the course and re-downloading them, starting from scratch. My friend was able to get everything working on his computer without issue and so we cannot figure out what the problem is. Can you offer any advice or fixes?

Thank you!

Names off in cast.csv

Not sure what I'm doing wrong here. The build seemed to go fine and I added the encoding tag, but the names are looking funky. If anyone has time to send me some advice, it would be much appreciated! Cheers!

"untrusted notebook pycon-pandas-tutorial/Exercises-1.ipynb"

Hi,

The Exercises-1.ipynb notebook can't connect to the Notebook server.
Any ideas?

David-Laxers-MacBook-Pro:~ davidlaxer$ ipython notebook --profile=nbserver
[I 19:08:00.252 NotebookApp] Using MathJax from CDN: https://cdn.mathjax.org/mathjax/latest/MathJax.js
[I 19:08:00.317 NotebookApp] Serving notebooks from local directory: /Users/davidlaxer
[I 19:08:00.317 NotebookApp] 0 active kernels
[I 19:08:00.317 NotebookApp] The IPython Notebook is running at: https://127.0.0.1:9999/
[I 19:08:00.317 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[I 19:08:25.637 NotebookApp] Writing notebook-signing key to /Users/davidlaxer/.ipython/profile_nbserver/security/notebook_secret
[W 19:08:25.645 NotebookApp] Notebook pycon-pandas-tutorial/Exercises-1.ipynb is not trusted
[I 19:08:27.843 NotebookApp] Kernel started: d34339cc-4141-49f9-aa53-1ba050f2dc5d
[I 19:10:27.956 NotebookApp] Saving file at /pycon-pandas-tutorial/Exercises-1.ipynb
[W 19:10:27.960 NotebookApp] Saving untrusted notebook pycon-pandas-tutorial/Exercises-1.ipynb

Where to download data?

I can't find the instructions to download the data: titles.csv and cast.csv - any hints?

Error when running BUILD.py

Hi!

I'm having problems with this, I tried checking the other issues but it doesn't seem to be the same problem

Thanks!

brandon-rhodes / pycon-pandas-tutorial Goto Github PK

pycon-pandas-tutorial's Introduction

Welcome to Brandon’s Pandas Tutorial

Quick Start

Detailed Instructions

pycon-pandas-tutorial's People

Contributors

Stargazers

Watchers

Forkers

pycon-pandas-tutorial's Issues

How many movies were made in the year 1950?

How many movies were made in the year 1960?

Example of error message on third cell (referenced above):

Recommend Projects

Recommend Topics

Recommend Org