econpy / google-ngrams Goto Github PK

Python scripts for retrieving CSV data from the Google Ngram Viewer and plotting it in XKCD style. The Python script for retrieving ngram data was originally modified from the script at www.culturomics.org.

License: MIT License

Python 100.00%

google-ngrams's Introduction

About

Here you'll find a basic python script to retrieve data behind the trajectories plotted on the Google Ngram Viewer. A Python script that creates XKCD style plots from the ngram CSV data is also included, making it simple to create some awesome looking plots!

Dependencies

Usage

Simply type the same query you would type at the Google Ngram Viewer and retrieve the data in csv format.

Quick Gotchas

By default, the data is printed on screen and saved to a file in the working directory.
Add the -plot option to your query and an XKCD style plot like the one to the left will be saved in the working directory as well.
Searches are case-sensitive by default. To perform case-insenitive searches, pass the -caseInsensitive option to your query. The result will be the sum of all common formats of the query (lowercase, uppercase, titlecase, etc).
The syntax for modifier and wildcard searches has been slightly modified in order to make the script work as a command line tool. See below for more information on these minor changes.

Options

corpus [default: eng_2012] This will run the query in CORPUS. Possible values are recapitulated below and here.
startYear [default: 1800]
endYear [default: 2000]
smoothing [default: 3] Smoothing parameter (integer)
caseInsensitive Return case-insensitive results
plot Return an XKCD style plot as a .png file
alldata Return every column of available data**
nosave Results will not be saved to file
noprint Results will not be printed on screen
help Prints this screen

** This can be used with inflection, wildcard, and case-insensitive searches (otherwise it does nothing) where one column is the sum of some of the other columns (labeled with a column name ending in "(All)" or an asterisk for wildcard searches). In the Google Ngram Viewer, the columns whose sum makes up this column is viewable by right clicking on the ngram plot. In the getngrams.py script, these columns are dropped by default, but you can keep them by adding -alldata to your query.

Examples

There are tons of examples below that demonstrate of all kinds of available queries.

Basic Examples

Here are some basic example uses of getngrams.py:

python getngrams.py Albert Einstein, Charles Darwin
python getngrams.py aluminum, copper, steel -noprint
python getngrams.py Pearl Harbor, Watergate -corpus=eng_2009
python getngrams.py bells and whistles -startYear=1900 -endYear=2001 -smoothing=2
python getngrams.py internet --startYear=1980 --endYear=2000 --corpus=eng_2012 -caseInsensitive

More Complicated Examples

Wildcard Searches

As in the full Google Ngram Viewer, you can also perform wildcard searches using getngrams.py.

When doing a wildcard search, use the ? character instead of the * character. Using an asterisk will cause the getngrams.py script to fail because your shell will expand the asterisk before Python has a chance to see it.

python getngrams.py United ? --startYear=1850 --endYear=2000 -alldata
python getngrams.py University of ?
python getngrams.py University of ?, ? State University -alldata

Modifier Searches

Modifier searches let you see how often one more modifies another word. The usual syntax for doing a modifier search is by using the => operator. For example, running the query dessert=>tasty would match all instances of when the word tasty was used to modify the word dessert.

Modifier searches can be done using getngrams.py, but you must replace the => operator with the @ character.

python getngrams.py car@fast -startYear=1900 -endYear=2000
python getngrams.py car@fast -startYear=1900 -endYear=2000 -alldata
python getngrams.py drink@?_NOUN -startYear=1900 -endYear=2000 -alldata

For more information on wildcard and modifier searches, take a look at the About Ngram Viewer page for more in depth documentation.

Other Examples

python getngrams.py book ? hotel, book_INF a hotel --startYear=1920 --endYear=2000 -alldata
python getngrams.py read ?_DET book
python getngrams.py _DET_ bright_ADJ rainbow
python getngrams.py _START_ President ?_NOUN
python getngrams.py _ROOT_@will

Possible Corpora

eng_2012, eng_2009, eng_us_2012, eng_us_2009, eng_gb_2012, eng_gb_2009, chi_sim_2012, chi_sim_2009, fre_2012,
fre_2009, ger_2012, ger_2009, spa_2012, spa_2009, rus_2012, rus_2009, heb_2012, heb_2009, ita_2012,
eng_fiction_2012, eng_fiction_2009, eng_1m_2009

Plotting

There are 2 easy ways to create your own plots using a CSV file produced by running a query with getngrams.py. To demonstrate the 2 methods, we'll run the following query:

python getngrams.py railroad,radio,television,internet -startYear=1900 -endYear=2000 -caseInsensitive

Plotting w/ xkcd.py

The first way to create a plot is to use the supplied xkcd.py script to generate awesome XKCD style charts. However, there are two ways to use the script:

Add the -plot option to your command when running getngrams.py:

python getngrams.py railroad,radio,television,internet -startYear=1900 -endYear=2000 -plot -caseInsensitive

You can also use xkcd.py directly by passing the CSV file as an argument:

python xkcd.py railroad_radio_television_internet-eng_2012-1900-2000-3-caseInsensitive.csv

Both methods produce the same chart:

Plotting w/ Pandas

Another way to plot data from an ngram CSV file is to read the file into a pandas DataFrame object and call the .plot() option on it. Here we do that, but also convert the data to percentages first and add a title to the plot:

from pandas import read_csv
df = read_csv('railroad_radio_television_internet-eng_2012-1900-2000-3-caseInsensitive.csv',
              index_col=0,
              parse_dates=True)
for col in df.columns:
    df[col] = [i*100 for i in df[col]]
df.plot(title='Railroad, Radio, Television, and Internet')

License

MIT License

Moreover, PLEASE do respect the terms of service of the Google Ngram Viewer while using this code. This code is meant to help viewers retrieve data behind a few queries, not bang at Google's servers with thousands of queries. The complete dataset can be freely downloaded here. This code is not a Google product and is not endorsed by Google in any way.

With this in mind... happy plotting!

google-ngrams's People

Contributors

Stargazers

Watchers

Forkers

bradleyt goryszewskig pforpallav anoopkunchukuttan differe94nt varzan culturalobservatory danangcode dschien mzhang001 metricle dhootha csheldonhess timguoqk yathish1618 rjonczy liangkai plichten mjlavin80 daviswhitehead ozak maizifang andre-santos alicew02 james2060 mantrapps dkim0718 konnpaku-youmu sabirdvd deveshbatra michiexile yhjohn163 anastasiosv haydncci jtcarlyle aditijain29 estellaleee rajput20 epinhoodceo micmalti wuwanjun bosehere ianbstewart jasekr zi-ao-huang zslwyuan petershan1119 shrra eric-seekas steveshelnanma dthadi3 tilusnet santosh653 machari bigdatasciencegroup milioe gururajrkatti yasinkutuk rubenus silvira izzy-lazerson xu-kai-xu thomaslepoutre teddy21019 iq-scm prompted365

google-ngrams's Issues

My issues is that this is awesome, that is all.

Thank you for making this, so easy and simple to use.

The number of results returned by wildcard search

Hi dear author,

I notice that I can only get the top 10 results when performing wildcard search (e.g., drink@?_NOUN),

are there any possible way to exceed such limits?

License

Thanks for the really nice project. Just wanted to say that you should really put an explicit license on this. Without a license, it's copyrighted. If you don't want to put a license on it, then you should say explicitly that it's in the public domain rather than saying "no license" (which is an implicit statement of copyright).

Back-end update

It looks like Google changed the back-end for returning n-gram data, and now the results are generated in a different format. I had to replace a line in getngrams.py to fix it.

Original (line 29):

res = re.findall('var data = (.*?);\\n', req.text)

New (line 29):

res = re.findall('ngrams.data = (.*?);\\n', req.text)

Graphing error

Great module. Any idea, however, why it is throwing the following error when I try to graph anything?

Traceback (most recent call last):
File "xkcd.py", line 86, in
plotXKCD(sys.argv[1])
File "xkcd.py", line 21, in plotXKCD
plt.xkcd(scale=2, randomness=2.75)
AttributeError: 'module' object has no attribute 'xkcd'

-caseInsentive doesn't seem to work for n-grams where n>1

This seems to be an issue with n-grams containing "I" .All n-grams containing "i" (in lower case) return a frequency value of 0 even when -caseInsensitive is added to the command.Is it that it only takes into account all occurrences of the n-gram in any case(lower,upper or title case) only if the provided form(case) of the n-gram exists in the corpora?

caseInsensitive mode has problems

python google-ngrams/getngrams.py Abenaki,Apache -startYear=1950 -endYear=2000

works fine

python google-ngrams/getngrams.py Abenaki,Apache -startYear=1950 -endYear=2000 --caseInsensitive

fails to return results for Abenaki.

Is this broken?

When I run it it does claim to have saved a csv file, but the csv-file is empty. Trying -corpus=eng_2019 also throws an error. Can somebody check whether this works for them? Thanks.

Fails with queries with spaces after `,`

Queries written as first term, second term fail to return all the data. From inspecting with a python debugger, it seems like the issue is in the part of the code that prunes excess columns: it searches for what effectively is the string " second term" in a list of column names that will contain "second term" without the initial space.

No graph opening

Hi, great tool you've made here.

When I try to plot a graph with the -plot arg, nothing comes up, and the program runs to completion.

my input is this--- python3 getngrams.py einstein, darwin - plot

I'm working on a mac if that makes a difference. Any idea why this would be happening?

That is when I run it from the command line. I also tried python3 xkcd.py einstein_darwin-eng_2012-1800-2000-3-caseSensitive.csv after that but nothing from there as well.

When I call your script from blender(i'm trying to interface my simulation code to make a graph')

It gives me some errors that look like this

Data saved to Einstein_Darwin-eng_2012-1800-2000-3-caseSensitive.csv
Unable to revert mtime: /Library/Fonts
Unable to revert mtime: /Library/Fonts/Microsoft
Traceback (most recent call last):
File "google-ngrams/xkcd.py", line 86, in
plotXKCD(sys.argv[1])
File "google-ngrams/xkcd.py", line 82, in plotXKCD
fig.savefig(ngramCSVfile.replace('.csv', '.png'), dpi=190)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/figure.py", line 1421, in savefig
self.canvas.print_figure(*args, **kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/backend_bases.py", line 2220, in print_figure
**kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/backends/backend_agg.py", line 505, in print_png
FigureCanvasAgg.draw(self)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/backends/backend_agg.py", line 451, in draw
self.figure.draw(self.renderer)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/artist.py", line 55, in draw_wrapper
draw(artist, renderer, *args, **kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/figure.py", line 1034, in draw
func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/artist.py", line 55, in draw_wrapper
draw(artist, renderer, *args, **kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/axes.py", line 2086, in draw
a.draw(renderer)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/artist.py", line 55, in draw_wrapper
draw(artist, renderer, *args, **kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/axis.py", line 1093, in draw
renderer)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/axis.py", line 1042, in _get_tick_bboxes
extent = tick.label1.get_window_extent(renderer)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/text.py", line 754, in get_window_extent
bbox, info, descent = self._get_layout(self._renderer)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/text.py", line 320, in _get_layout
ismath=False)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/text.py", line 312, in get_text_width_height_descent
*kl, **kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/backend_bases.py", line 584, in get_text_width_height_descent
font = self._text2path._get_font(prop)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/textpath.py", line 52, in _get_font
font = FT2Font(str(fname))
RuntimeError: Could not open facefile Humor-Sans.ttf; Cannot_Open_Resource

so i'm assuming it's something with the font? Have you seen this before?

Unable to process German, Chinese and Hebrew in case inSensitive mode

This can be reproduced with this query:

python getngrams.py פמיניזם:heb_2012, 女性主义:chi_sim_2012 --startYear= --endYear=2008 -caseInsensitive -smoothing=1

The problem is that these languages only return one case, so there is no (all) column so the data is thrown away in the -AllData routine

Some queries crash the plotting logic

One example of running the script (taken from the README) is:
python getngrams.py _START_ President ?_NOUN
This works fine and produces a CSV but if you try to alter the command to also output a plot:
python getngrams.py _START_ President ?_NOUN -plot
the xkcd.py module will throw an exception:

Traceback (most recent call last):
  File "xkcd.py", line 84, in <module>
    plotXKCD(sys.argv[1])
  File "xkcd.py", line 47, in plotXKCD
    for label in legend.get_texts():
AttributeError: 'NoneType' object has no attribute 'get_texts'

Make it a module?

This script is pretty useful to me and I can think of people that might want to use it.
Have you thought of putting it up on PyPi?

Returns different values from the Google nGram viewer

The values that can be read from the graph at https://books.google.com/ngrams differ from the values reported back by the script. From inspecting with a debugger, it seems like the website values correspond to the (All) option that gets removed before returning the data.

econpy / google-ngrams Goto Github PK

google-ngrams's Introduction

About

Dependencies

Usage

Quick Gotchas

Options

Examples

Basic Examples

More Complicated Examples

Wildcard Searches

Modifier Searches

Other Examples

Possible Corpora

Plotting

Plotting w/ xkcd.py

Plotting w/ Pandas

License

google-ngrams's People

Contributors

Stargazers

Watchers

Forkers

google-ngrams's Issues

Recommend Projects

Recommend Topics

Recommend Org