seancarmody / ngramr Goto Github PK
View Code? Open in Web Editor NEWR package to query the Google Ngram Viewer
License: Other
R package to query the Google Ngram Viewer
License: Other
The Google site is not working as at 00:00 2020-09-20 UTC
install_github
uses RCurl
which does not abide by the Internet2 proxy setting. Need to add instructions for an alternative approach using install_local
.
Hi,
Perhaps google has changed its site again?
ng <- ngram(c("hacker", "programmer"), year_start = 1950)
Please check Google's Ngram Viewer site is up.
Timeout was reached: [books.google.com] Send failure: Connection was aborted
ng <- ngram(c("hacker", "programmer"), year_start = 1950)
Error parsing ngram data, please contact package maintainer.
Here's the original error message:
subscript out of bounds
Error occurred in the following code:
[[stringr::str_split(grep("drawD3Chart", years, value = TRUE), ",")1
ngramr
is now completely broken! Google has switched to https and has changed the data format. Parsing must be completely rebuilt.
Hello, my name is Joemari. I'm submitting this issue to inform you of a bug I think is in the package ngramr. I was using the ngramr package to plot frequencies of words I could potentially use in my independent study as stimuli. However, when I was running the program in r there was an error that popped up: 'subscript out of bounds' or the frequency plots would just read 'NULL' in the environment. I've even copied and pasted the examples you've given in this repository and it still yields the same errors. Here is a screenshot:
The current structure of ngram.R allows for multi-corpus search of a single word, but doesn't quite cope with a combined search like the given "test:eng_2012, испытание:rus_2012"
. It also needlessly does a single HTTP request for each phrase string when it could just pass through multiple (in case of single global corpus set)..
I tried to at least parse the correct corpora when creating the data.frame ala
corpus_parsed <- regmatches(phrases, regexpr("(?<=:).*", phrases, perl=TRUE))
, but didn't get much farther. Maybe you can come up with something more routinely than me 👍
I just posted the following over at your blog , but will repeat here in case you don't check that frequently (I am not online regularly myself):
I just installed.packages("ngramr") and tried to execute the example you gave at Github, i.e.,
ng <- ngram(c("hacker", "programmer"), year_start = 1950)
R complains that
Error: 'relocate' is not an exported object from 'namespace:dplyr'
I have dplyr and went ahead and tried
library(dplyr)
and re-execute the example line, but got the same error.
I have just started using R so may be missing some obvious solution about making the desired object "relocate" available. However, off the top of my head, it seems "relocate" would need a reference prepend of dplry or the like.
Looking forward to seeing this work! I have been doing Ngram and other linguistic work lately (whew---the Google Million raw datasets are really bad, full of foreign unicode, freq counts for all caps of same low case words, a real pain to use for any serious work).
Am running R version 3.6.1 (2019-07-05) on Ubuntu Linux 18.04 on a Dell Precision laptop. I have all the packages listed in your August 24, 2020 document:
Imports httr, rlang, RCurl, dplyr, cli, tibble, tidyr, rjson, stringr,
ggplot2, scales, xml2, textutils, lifecycle
Thanks,
Dalton
At the moment smoothing
is set to 3 by default, consistent with the default on the Google Ngram Viewer page. While that works well with lines, now that ggram
accepts arbitrary geoms, this smoothing default is not so good with step
or point
. Should the default be changed to 0? Should the default be 3 for line
and 0 for other geoms?
Hi, and thanks for creating this package! I'm having trouble getting it to work for me, and I hope you can offer some advice.
I've got it to the point where it downloads and plots the ngram data for me, but the plot really doesn't resemble the equivalent (?) plot I'm getting from Google. The Google graph is here:
I've tried to reproduce it with the following code, borrowed / modified from Daisung Jang's tutorial at https://daisungjang.com/tutorial/Ngram_tutorial.html:
library(ngramr)
data <- as.data.frame(matrix(ncol=1, nrow=109))
data$V1 <- seq(from=1900, to=2008)
names(data)[names(data)=="V1"] <- "Year"
search_terms <- c("international order", "international institutions", "international regimes")
for(i in 1:length(search_terms)){
# Get each search term and store those in objects
term <- search_terms[i]
# Search for the term in the English 2012 corpus, starting from the year 1900 to 2008
# Then house the output in a dataframe
temp <- ngram(term, year_start = 1900, corpus="eng_2019", smoothing = 2)
# Merge NYT data with dataframe created step 1, matching by years
data <- merge(data, temp[,c("Year", "Frequency")], by ="Year", all.x=TRUE)
# Reaname column by search term
colname <- paste(term, sep="")
# Rename added column with ID
names(data)[names(data)=="Frequency"] <- colname
}
data_long <- reshape(data,
varying = c("international order", "international institutions", "international regimes"),
v.names = "Frequency",
timevar = "search_term",
times = c("international order", "international institutions", "international regimes"),
direction = "long")
library(ggplot2)
p <- ggplot(data_long, aes(x=Year, y=Frequency, group=search_term))
p + geom_line(aes(colour = search_term))
As you'll see, the trends look very different. In the Google version, there's a surge in the use of the phrase "international institutions" after WWII; in the R version, there's a nearly identical surge, but in the use of a different phrase, "international order." That term then more or less flatlines in the Google version but continues to climb in the R version. The curve for "international regimes" is approximately right, but not exactly, and it maps to about the same y-axis scale as it does on the Google version, while the others appear to be on very different scales.
All in all, there are enough similarities to make me suspect that I'm more or less on the right track, but the overall pictures are dramatically different. I've tried varying all the ngramr parameters that I can find, but no combination I've tried produces a graph that looks like Google's. Any help appreciated!, and apologies in advance if this is a me problem.
Hi there. I've encountered the following error when making a standard query.
> library(ngramr)
>
> ng <- ngrami("human capital", year_start = 1800, smoothing = 0)
Error parsing ngram data, please contact package maintainer.
Here's the original error message:
subscript out of bounds
Error occurred in the following code:
[[stringr::str_split(grep("drawD3Chart", years, value = TRUE), ",")1
Any idea how to resolve this?
Thanks for maintaining such a nice package!
I have submitted it. Fingers crossed!
It looks as though a recent change from the Rcurl
to curl
package can affect internet access behind a proxy
As an example, Google returns the following message
Note: the Ngram Viewer treats quotation marks literally.
Replaced door-handle with door - handle to match how we processed the books.
Replaced host's with host 's to match how we processed the books.
with this query.
Hi, your great package seems to be not working. Many thanks for your maintaining this package.
ng <- ngram(c("hacker", "programmer"), year_start = 1950)
Error parsing ngram data, please contact package maintainer.
Here's the original error message:
subscript out of bounds
Error occurred in the following code:
[[stringr::str_split(grep("drawD3Chart", years, value = TRUE), ",")1
Changing to tidyr need to refactor this code
Getting a little ahead of things since Yosemite is still in beta, but ... calls to ngram() fail when running on Yosemite with the following error:
> library(ngramr)
> ngram("programmer")
Error in function (type, msg, asError = TRUE) :
SSL: certificate verification failed (result: 5)
Not a showstopper since I still have 10.9 systems, of course, but thought you might want to know. Thanks for all your work on the package!
Here's information about my setup. Short version: R 3.1.1, RStudio 0.98.1028, Mac OS 10.10, ngramr 1.4.3. Full output of version
:
platform x86_64-apple-darwin13.1.0
arch x86_64
os darwin13.1.0
system x86_64, darwin13.1.0
status
major 3
minor 1.1
year 2014
month 07
day 10
svn rev 66115
language R
version.string R version 3.1.1 (2014-07-10)
nickname Sock it to Me
Thank you so much for providing this package! I am looking for an opportunity, to download the frequency of many words at the same time and to get the information about their position in the sentences where they are used. For instance: I would like to be able to compare the frequence of "apple" being used as an subject or an object. Is there a possibility to adjust your code and to do so, and further, to also proceed with more calculations than just plotting the frequencies?
All the best and thank you in advance!
Just installed the package today, and ran into this error. Is this to do with my installation, or perhaps Google have changed something on their end?
Error in fromJSON(sub(".*=", "", html[data_line])) : CHAR() can only be applied to a 'CHARSXP', not a 'NULL'
Traceback:
8. | fromJSON(sub(".*=", "", html[data_line]))
7. ngram_parse(html)
6. | ngram_fetch(phrase, corpus_n, case_ins, ...)
5. | ngram_single(phrases, corpus = corp, year_start = year_start, year_end = year_end, smoothing = smoothing, tag = tag, case_ins)
4. FUN(X[[i]], ...)
3. | lapply(corpus, function(corp) ngram_single(phrases, corpus = corp, year_start = year_start, year_end = year_end, smoothing = smoothing, tag = tag, case_ins))
2. ngram(phrases, ...)
I get this error while running my code:
Warning: NAs introduced by coercionError in stringr::str_split(grep("drawD3Chart", json, value = TRUE), ",")[[1]] :
subscript out of bounds
All my phrases have plots in n-gram. My code was working a week ago now it just sends me errors.
Hi,
In 1.5.x, the syntax below worked just fine for me
ging <- paste0(ip, " *")
ng <- ngramr::ngram(ging, year_start = 1950)
where ip is a string so that ging could be (say)
"a bachelor's degree in *".
If I go to Google & the ngram viewer, this string
will return results.
But from R, with 1.7.x of the package, I'm getting
the warning
The characters +, -, *, / require parentheses to be interpreted as a composition.
Have since tried
"(a bachelor's degree in) *"
"((a bachelor's degree in) *)"
& others, but without success.
In the new google website, a string like this
(The United States is + The United States has) / The United States
is replaced with
((The United States is + The United States has) / The United States)
However, the former will result in an error in ngramr
. This condition needs to be trapped and the outer parentheses added.
Axis labels and gridlines are mis-aligned.
This doesn't seem right - data needs to be corrected.
With the change to using RCurl
to access the Google SSL pages, ngramr
no longer works behind a proxy. Need to add options to configure the proxy.
Thanks for the README update @briatte. Where/how are you hosting the images?
I had added a require(scales)
line in ggram
as suggested in an earlier issue raised by @briatte but that did not pass check packages. Is it really necessary? Is there another way to deal with the issue?
The Google Ngram Viewer allows some fancy phrases such as "fancy=>pants". In the Javascript this comes back as ''fancy\u003D\u003Epants" and so I need to convert this Unicode encoding back to ASCII. I am stumped.
As well as a corpus
argument, it would be good to be able to specify edition (2009 or 2012) and language (e.g. eng) separately,
Currently reads
require(devtools)
install_github("ngramr", "seancarmody")
require(ngramr)
should read
require(devtools)
install_github("seancarmody/ngramr")
require(ngramr)
Hi there,
first of all, thank you so much for this library!! I discovered it recently and it really made my month!!!! :-)
I am opening this issue because I am encountering a connection error when trying to run ngram() within a loop over a large number of ngrams.
I have more than 600 Ngrams to go through, and a 'for' loop would work perfectly for me, except, that the connection to the google.book webpage seems to be lost after about 75 Ngrams.
I thought you may have a hint for me to try and fix this issue...
Any clue would be greatly appreciated!
Thanks again!
Here is the error message I get
Error in open.connection(x, "rb") :
cannot open the connection to 'https://books.google.com/ngrams/graph?content=lenteur+s%27oppose+%C3%A0+la+rapidit%C3%A9&corpus=30&year_start=1800&year_end=2019&smoothing=0&case_insensitive=on'
Here is some reproducible code:
Ngram_data <- structure(list(id_overall = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42,
43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58,
59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74,
75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 100), Xgram_pattern = c("écolier marche sur la banquise",
"métal sert à faire des bouchons", "canaille utilise sa carapace",
"chiffre se multiplie par la largeur", "dicton est un proverbe",
"éléphant utilise sa trompe", "entaille est une coupure", "estomac permet la digestion",
"seau permet un bon arrosage", "châle est mis pour le carnaval",
"mouches envahissent ce camping", "gratin est garni de jambon",
"entretien donne suite à une embauche", "formule est utilisée en mathématiques",
"cigale demande à la fourmi", "habitudes entrainent une routine",
"blason est un aigle", "main remplace la rame", "terreau se mélange au thym",
"papyrus est une copie", "batterie remplace les piles", "beurre est du gras",
"biche donne naissance à un faon", "bougie s'écoule de la cire",
"disgrâce n'est pas une qualité", "matériel mesure la vitesse",
"compère emmène sa maîtresse", "benne ramasse les ordures",
"gruyère est rangé avec les fromages", "brevet est obtenu à la fin du collège",
"tour est construite par une araignée", "torchons sont rangés avec les serviettes",
"enceinte regorge de lions", "garçon indique la note", "matière est utilisée avec la règle",
"sourd est également muet", "soutien est une aide", "tribunal est le lieu de travail des juges",
"caoutchouc sert à fermer le pot", "viande est du boeuf", "préposés ordonnent aux enfants",
"tache provient du soleil", "évènement a été provoqué par une voiture",
"fichier est constitué de plusieurs pièces", "cendrier contient des cigarettes",
"cernes traduisent un manque de sommeil", "nid a été construit par un oiseau",
"tarot se pratique avec des cartes", "voisine lit mon avenir",
"album contient des photos", "coffret a été ouvert à la mort",
"dessin représente son amour", "minuteur sonne le matin", "voisin n'est pas un ami",
"songe est un rêve", "plombage est posé sur la dent", "épouse demande à son mari",
"oncle se promène avec ma tante", "virtuoses éclairent la nuit",
"bague entoure le doigt", "cageot regorge de raisin", "mouvement se fait dans la piscine",
"bête mange des carottes", "vermine a mangé la noisette", "flageolet est un haricot",
"framboise ressemble à une fraise", "gigot provient d'un agneau",
"goéland ressemble à une mouette", "loisir se joue avec des baguettes",
"chimère utilise ses tentacules", "village est construit par des esquimaux",
"barque permet le sauvetage", "informaticien répare des ordinateurs",
"innocent s'oppose à un coupable", "hamac permet de faire une sieste",
"lenteur s'oppose à la rapidité", "personne est une star",
"insecte est un ver", "homme utilise la magie", "planche permet une nage",
"caleçon remplace le slip", "canards se baignent dans la mare",
"challenge est un défi", "chouette diffère du hibou", "truc est un couteau",
"plaintes témoignent de sa tristesse", "politiciens marquent les limites",
"patron indique aux employés", "bombe entraine une explosion",
"arc est utilisé avec des flèches", "congrès précède le vendredi",
"firme est une entreprise", "bracelet est fabriqué avec du sucre",
"place accueille les fous", "salon est un bar", "tétine est saisie par le bébé",
"rosier produit des roses", "beau retrouve sa belle", "terrain est un parc",
"carré diffère d'un rond"), Xgram_dependencies = c("banquise=>écolier",
"bouchons=>métal", "carapace=>canaille", "largeur=>chiffre",
"proverbe=>dicton", "trompe=>éléphant", "coupure=>entaille",
"digestion=>estomac", "arrosage=>seau", "carnaval=>châle", "camping=>mouches",
"jambon=>gratin", "embauche=>entretien", "mathématiques=>formule",
"fourmi=>cigale", "routine=>habitudes", "aigle=>blason", "rame=>main",
"thym=>terreau", "copie=>papyrus", "piles=>batterie", "gras=>beurre",
"faon=>biche", "cire=>bougie", "qualité=>disgrâce", "vitesse=>matériel",
"maîtresse=>compère", "ordures=>benne", "fromages=>gruyère",
"collège=>brevet", "araignée=>tour", "serviettes=>torchons",
"lions=>enceinte", "note=>garçon", "règle=>matière", "muet=>sourd",
"aide=>soutien", "juges=>tribunal", "pot=>caoutchouc", "boeuf=>viande",
"enfants=>préposés", "soleil=>tache", "voiture=>évènement",
"pièces=>fichier", "cigarettes=>cendrier", "sommeil=>cernes",
"oiseau=>nid", "cartes=>tarot", "avenir=>voisine", "photos=>album",
"mort=>coffret", "amour=>dessin", "matin=>minuteur", "ami=>voisin",
"rêve=>songe", "dent=>plombage", "mari=>épouse", "tante=>oncle",
"nuit=>virtuoses", "doigt=>bague", "raisin=>cageot", "piscine=>mouvement",
"carottes=>bête", "noisette=>vermine", "haricot=>flageolet",
"fraise=>framboise", "agneau=>gigot", "mouette=>goéland", "baguettes=>loisir",
"tentacules=>chimère", "esquimaux=>village", "sauvetage=>barque",
"ordinateurs=>informaticien", "coupable=>innocent", "sieste=>hamac",
"rapidité=>lenteur", "star=>personne", "ver=>insecte", "magie=>homme",
"nage=>planche", "slip=>caleçon", "mare=>canards", "défi=>challenge",
"hibou=>chouette", "couteau=>truc", "tristesse=>plaintes", "limites=>politiciens",
"employés=>patron", "explosion=>bombe", "flèches=>arc", "vendredi=>congrès",
"entreprise=>firme", "sucre=>bracelet", "fous=>place", "bar=>salon",
"bébé=>tétine", "roses=>rosier", "belle=>beau", "parc=>terrain",
"rond=>carré")), row.names = c(NA, -100L), class = c("tbl_df",
"tbl", "data.frame"))
# extract and store Google Ngram frequencies for each Ngram_pattern
for (i in 1:nrow(Ngram_data))
{
# launch Google Ngram query and extract the results that are displayed with viewer
google_ngram_tmp <- ngram(Ngram_data$Xgram_pattern[i],
corpus = "fre_2019",
year_start = 1800,
year_end = 2019,
smoothing = 0,
case_ins = TRUE,
aggregate = TRUE)
closeAllConnections()
# returns the Ngram index at each iteration to keep track of the computation
print (paste("Ngram index", i, sep = ": ") )
# print(showConnections(all = FALSE))
}
Hello,
Great package and has been very helpful already! One minor bug, when using ngram(..., case_ins = TRUE) the correct data is extracted however the automatic output still states "Case-sensitive: TRUE".
This may be intentional since the freq are indeed case sensitive since it just extracts both the freq for lowercase and the freq for uppercase. I just thought I'd mention it because I'm not sure what this part of the output would be used for otherwise. Thanks for your contribution to the R community!
Blessings,
David
I'm trying to use the 'counts' parameter to derive the non-fiction frequencies from the eng_2019 and eng_fiction_2019 data. My assumption was that the eng_fiction_2019 count would always be less than or equal to the eng_2019 count. This does not appear to be the case in all instances.
I'm also assuming that count/frequency is the total and that the differences between the counts and totals allows me to calculate the non-fiction frequency.
Have I got this wrong?
My test case is (html + HTML).
Cheers,
Andrew
I have tried (install_packages)(Install_local)
All return as this is not a mac binary package
Error in install.packages : file ‘/var/folders/dq/kg1jg0jj7pldyf4k_30pqgg80000gn/T//Rtmp3rwJw3/downloaded_packages/cli_3.4.1.tgz’ is not a macOS binary package
I am using R 4.1
This simple call from the README seems to now systematically return a NULL:
ng <- ngram(c("hacker", "programmer"), year_start = 1950)
I have tried the current dev version 1.9.2 as well as 1.9.1 and 1.9.0, with the same result.
It seems to me that the query gets formed properly, I was able to land to the right page using the query that was generated. Pulling the html data does not generate an error, and the html itself appears to hold the data we want to extract. It's when the html data is fed into the function ngram_fetch_data that apparently there is an issue. I don't know html/xml well enough to identify the issue, but I presume that Google must have changed something.
Perhaps you could add this function to the other ggplot2 helpers in the GGally package.
It would be interesting if ggram
could get an ignore.case = TRUE
option that would call ngrami
instead of ngram
. I am not sure I understand the code well enough to implement this (the smoothing
and aggregate
arguments are mysterious to me).
When I run these commands:
library (ngramr)
ng <- ngram(c("hacker", "programmer"), year_start = 1950)
I get this message:
Error in stringr::str_split(grep("drawD3Chart", json, value = TRUE), ",")[[1]] :
subscript out of bounds
In addition: Warning message:
In ngram_fetch_data(html) : NAs introduced by coercion
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.