Coder Social home page Coder Social logo

econpy / google-ngrams Goto Github PK

View Code? Open in Web Editor NEW
248.0 16.0 76.0 317 KB

Python scripts for retrieving CSV data from the Google Ngram Viewer and plotting it in XKCD style. The Python script for retrieving ngram data was originally modified from the script at www.culturomics.org.

License: MIT License

Python 100.00%

google-ngrams's Introduction

About

Here you'll find a basic python script to retrieve data behind the trajectories plotted on the Google Ngram Viewer. A Python script that creates XKCD style plots from the ngram CSV data is also included, making it simple to create some awesome looking plots!

Dependencies

Usage

Simply type the same query you would type at the Google Ngram Viewer and retrieve the data in csv format.

Quick Gotchas

  • By default, the data is printed on screen and saved to a file in the working directory.
  • Add the -plot option to your query and an XKCD style plot like the one to the left will be saved in the working directory as well.
  • Searches are case-sensitive by default. To perform case-insenitive searches, pass the -caseInsensitive option to your query. The result will be the sum of all common formats of the query (lowercase, uppercase, titlecase, etc).
  • The syntax for modifier and wildcard searches has been slightly modified in order to make the script work as a command line tool. See below for more information on these minor changes.

Options

  • corpus [default: eng_2012] This will run the query in CORPUS. Possible values are recapitulated below and here.
  • startYear [default: 1800]
  • endYear [default: 2000]
  • smoothing [default: 3] Smoothing parameter (integer)
  • caseInsensitive Return case-insensitive results
  • plot Return an XKCD style plot as a .png file
  • alldata Return every column of available data**
  • nosave Results will not be saved to file
  • noprint Results will not be printed on screen
  • help Prints this screen

** This can be used with inflection, wildcard, and case-insensitive searches (otherwise it does nothing) where one column is the sum of some of the other columns (labeled with a column name ending in "(All)" or an asterisk for wildcard searches). In the Google Ngram Viewer, the columns whose sum makes up this column is viewable by right clicking on the ngram plot. In the getngrams.py script, these columns are dropped by default, but you can keep them by adding -alldata to your query.

Examples

There are tons of examples below that demonstrate of all kinds of available queries.

Basic Examples

Here are some basic example uses of getngrams.py:

python getngrams.py Albert Einstein, Charles Darwin
python getngrams.py aluminum, copper, steel -noprint
python getngrams.py Pearl Harbor, Watergate -corpus=eng_2009
python getngrams.py bells and whistles -startYear=1900 -endYear=2001 -smoothing=2
python getngrams.py internet --startYear=1980 --endYear=2000 --corpus=eng_2012 -caseInsensitive

More Complicated Examples

Wildcard Searches

As in the full Google Ngram Viewer, you can also perform wildcard searches using getngrams.py.

When doing a wildcard search, use the ? character instead of the * character. Using an asterisk will cause the getngrams.py script to fail because your shell will expand the asterisk before Python has a chance to see it.

python getngrams.py United ? --startYear=1850 --endYear=2000 -alldata
python getngrams.py University of ?
python getngrams.py University of ?, ? State University -alldata
Modifier Searches

Modifier searches let you see how often one more modifies another word. The usual syntax for doing a modifier search is by using the => operator. For example, running the query dessert=>tasty would match all instances of when the word tasty was used to modify the word dessert.

Modifier searches can be done using getngrams.py, but you must replace the => operator with the @ character.

python getngrams.py car@fast -startYear=1900 -endYear=2000
python getngrams.py car@fast -startYear=1900 -endYear=2000 -alldata
python getngrams.py drink@?_NOUN -startYear=1900 -endYear=2000 -alldata

For more information on wildcard and modifier searches, take a look at the About Ngram Viewer page for more in depth documentation.

Other Examples
python getngrams.py book ? hotel, book_INF a hotel --startYear=1920 --endYear=2000 -alldata
python getngrams.py read ?_DET book
python getngrams.py _DET_ bright_ADJ rainbow
python getngrams.py _START_ President ?_NOUN
python getngrams.py _ROOT_@will
Possible Corpora
eng_2012, eng_2009, eng_us_2012, eng_us_2009, eng_gb_2012, eng_gb_2009, chi_sim_2012, chi_sim_2009, fre_2012,
fre_2009, ger_2012, ger_2009, spa_2012, spa_2009, rus_2012, rus_2009, heb_2012, heb_2009, ita_2012,
eng_fiction_2012, eng_fiction_2009, eng_1m_2009

Plotting

There are 2 easy ways to create your own plots using a CSV file produced by running a query with getngrams.py. To demonstrate the 2 methods, we'll run the following query:

python getngrams.py railroad,radio,television,internet -startYear=1900 -endYear=2000 -caseInsensitive

Plotting w/ xkcd.py

The first way to create a plot is to use the supplied xkcd.py script to generate awesome XKCD style charts. However, there are two ways to use the script:

  1. Add the -plot option to your command when running getngrams.py:
python getngrams.py railroad,radio,television,internet -startYear=1900 -endYear=2000 -plot -caseInsensitive
  1. You can also use xkcd.py directly by passing the CSV file as an argument:
python xkcd.py railroad_radio_television_internet-eng_2012-1900-2000-3-caseInsensitive.csv

Both methods produce the same chart:

Plotting w/ Pandas

Another way to plot data from an ngram CSV file is to read the file into a pandas DataFrame object and call the .plot() option on it. Here we do that, but also convert the data to percentages first and add a title to the plot:

from pandas import read_csv
df = read_csv('railroad_radio_television_internet-eng_2012-1900-2000-3-caseInsensitive.csv',
              index_col=0,
              parse_dates=True)
for col in df.columns:
    df[col] = [i*100 for i in df[col]]
df.plot(title='Railroad, Radio, Television, and Internet')

License

MIT License

Moreover, PLEASE do respect the terms of service of the Google Ngram Viewer while using this code. This code is meant to help viewers retrieve data behind a few queries, not bang at Google's servers with thousands of queries. The complete dataset can be freely downloaded here. This code is not a Google product and is not endorsed by Google in any way.

With this in mind... happy plotting!

google-ngrams's People

Contributors

econpy avatar timguoqk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

google-ngrams's Issues

License

Thanks for the really nice project. Just wanted to say that you should really put an explicit license on this. Without a license, it's copyrighted. If you don't want to put a license on it, then you should say explicitly that it's in the public domain rather than saying "no license" (which is an implicit statement of copyright).

Back-end update

It looks like Google changed the back-end for returning n-gram data, and now the results are generated in a different format. I had to replace a line in getngrams.py to fix it.

Original (line 29):

res = re.findall('var data = (.*?);\\n', req.text)

New (line 29):

res = re.findall('ngrams.data = (.*?);\\n', req.text)

Graphing error

Great module. Any idea, however, why it is throwing the following error when I try to graph anything?

Traceback (most recent call last):
File "xkcd.py", line 86, in
plotXKCD(sys.argv[1])
File "xkcd.py", line 21, in plotXKCD
plt.xkcd(scale=2, randomness=2.75)
AttributeError: 'module' object has no attribute 'xkcd'

-caseInsentive doesn't seem to work for n-grams where n>1

This seems to be an issue with n-grams containing "I" .All n-grams containing "i" (in lower case) return a frequency value of 0 even when -caseInsensitive is added to the command.Is it that it only takes into account all occurrences of the n-gram in any case(lower,upper or title case) only if the provided form(case) of the n-gram exists in the corpora?

caseInsensitive mode has problems

python google-ngrams/getngrams.py Abenaki,Apache -startYear=1950 -endYear=2000

works fine

python google-ngrams/getngrams.py Abenaki,Apache -startYear=1950 -endYear=2000 --caseInsensitive

fails to return results for Abenaki.

Is this broken?

When I run it it does claim to have saved a csv file, but the csv-file is empty. Trying -corpus=eng_2019 also throws an error. Can somebody check whether this works for them? Thanks.

Fails with queries with spaces after `,`

Queries written as first term, second term fail to return all the data. From inspecting with a python debugger, it seems like the issue is in the part of the code that prunes excess columns: it searches for what effectively is the string " second term" in a list of column names that will contain "second term" without the initial space.

No graph opening

Hi, great tool you've made here.

When I try to plot a graph with the -plot arg, nothing comes up, and the program runs to completion.

my input is this--- python3 getngrams.py einstein, darwin - plot

I'm working on a mac if that makes a difference. Any idea why this would be happening?

That is when I run it from the command line. I also tried python3 xkcd.py einstein_darwin-eng_2012-1800-2000-3-caseSensitive.csv after that but nothing from there as well.

When I call your script from blender(i'm trying to interface my simulation code to make a graph')

It gives me some errors that look like this

Data saved to Einstein_Darwin-eng_2012-1800-2000-3-caseSensitive.csv
Unable to revert mtime: /Library/Fonts
Unable to revert mtime: /Library/Fonts/Microsoft
Traceback (most recent call last):
File "google-ngrams/xkcd.py", line 86, in
plotXKCD(sys.argv[1])
File "google-ngrams/xkcd.py", line 82, in plotXKCD
fig.savefig(ngramCSVfile.replace('.csv', '.png'), dpi=190)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/figure.py", line 1421, in savefig
self.canvas.print_figure(*args, **kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/backend_bases.py", line 2220, in print_figure
**kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/backends/backend_agg.py", line 505, in print_png
FigureCanvasAgg.draw(self)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/backends/backend_agg.py", line 451, in draw
self.figure.draw(self.renderer)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/artist.py", line 55, in draw_wrapper
draw(artist, renderer, *args, **kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/figure.py", line 1034, in draw
func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/artist.py", line 55, in draw_wrapper
draw(artist, renderer, *args, **kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/axes.py", line 2086, in draw
a.draw(renderer)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/artist.py", line 55, in draw_wrapper
draw(artist, renderer, *args, **kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/axis.py", line 1093, in draw
renderer)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/axis.py", line 1042, in _get_tick_bboxes
extent = tick.label1.get_window_extent(renderer)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/text.py", line 754, in get_window_extent
bbox, info, descent = self._get_layout(self._renderer)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/text.py", line 320, in _get_layout
ismath=False)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/text.py", line 312, in get_text_width_height_descent
*kl, **kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/backend_bases.py", line 584, in get_text_width_height_descent
font = self._text2path._get_font(prop)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/textpath.py", line 52, in _get_font
font = FT2Font(str(fname))
RuntimeError: Could not open facefile Humor-Sans.ttf; Cannot_Open_Resource

so i'm assuming it's something with the font? Have you seen this before?

Unable to process German, Chinese and Hebrew in case inSensitive mode

This can be reproduced with this query:

python getngrams.py פמיניזם:heb_2012, 女性主义:chi_sim_2012 --startYear= --endYear=2008 -caseInsensitive -smoothing=1

The problem is that these languages only return one case, so there is no (all) column so the data is thrown away in the -AllData routine

Some queries crash the plotting logic

One example of running the script (taken from the README) is:
python getngrams.py _START_ President ?_NOUN
This works fine and produces a CSV but if you try to alter the command to also output a plot:
python getngrams.py _START_ President ?_NOUN -plot
the xkcd.py module will throw an exception:

Traceback (most recent call last):
  File "xkcd.py", line 84, in <module>
    plotXKCD(sys.argv[1])
  File "xkcd.py", line 47, in plotXKCD
    for label in legend.get_texts():
AttributeError: 'NoneType' object has no attribute 'get_texts'

Make it a module?

This script is pretty useful to me and I can think of people that might want to use it.
Have you thought of putting it up on PyPi?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.