seancarmody / ngramr Goto Github PK

View Code? Open in Web Editor NEW

48.0 48.0 9.0 1.44 MB

R package to query the Google Ngram Viewer

License: Other

R 100.00%

cran linguistics r

ngramr's People

Contributors

Stargazers

Watchers

Forkers

empee584 knowledgepro karafso nikolasbielski mutlay sedufau alfred-lim otichy jpulido12

ngramr's Issues

Ngram Viewer not working

The Google site is not working as at 00:00 2020-09-20 UTC

Can't download behind proxy

install_github uses RCurl which does not abide by the Internet2 proxy setting. Need to add instructions for an alternative approach using install_local.

Apparent new problem.

Hi,

Perhaps google has changed its site again?

ng <- ngram(c("hacker", "programmer"), year_start = 1950)
Please check Google's Ngram Viewer site is up.
Timeout was reached: [books.google.com] Send failure: Connection was aborted

ng <- ngram(c("hacker", "programmer"), year_start = 1950)
Error parsing ngram data, please contact package maintainer.
Here's the original error message:
subscript out of bounds
Error occurred in the following code:
[[stringr::str_split(grep("drawD3Chart", years, value = TRUE), ",")1

Fix ngramr to deal with new website format

ngramr is now completely broken! Google has switched to https and has changed the data format. Parsing must be completely rebuilt.

ngramr error in r: 'subscript out of bounds' or 'NULL' in environment

Hello, my name is Joemari. I'm submitting this issue to inform you of a bug I think is in the package ngramr. I was using the ngramr package to plot frequencies of words I could potentially use in my independent study as stimuli. However, when I was running the program in r there was an error that popped up: 'subscript out of bounds' or the frequency plots would just read 'NULL' in the environment. I've even copied and pasted the examples you've given in this repository and it still yields the same errors. Here is a screenshot:

Support for multi-phrase search and direct corpus specification ("test:eng_2012, испытание:rus_2012")

The current structure of ngram.R allows for multi-corpus search of a single word, but doesn't quite cope with a combined search like the given "test:eng_2012, испытание:rus_2012". It also needlessly does a single HTTP request for each phrase string when it could just pass through multiple (in case of single global corpus set)..
I tried to at least parse the correct corpora when creating the data.frame ala
corpus_parsed <- regmatches(phrases, regexpr("(?<=:).*", phrases, perl=TRUE)), but didn't get much farther. Maybe you can come up with something more routinely than me 👍

Could not get ng <- ngram(c("hacker", "programmer"), year_start = 1950) example to work

I just posted the following over at your blog , but will repeat here in case you don't check that frequently (I am not online regularly myself):

I just installed.packages("ngramr") and tried to execute the example you gave at Github, i.e.,
ng <- ngram(c("hacker", "programmer"), year_start = 1950)
R complains that
Error: 'relocate' is not an exported object from 'namespace:dplyr'
I have dplyr and went ahead and tried
library(dplyr)
and re-execute the example line, but got the same error.
I have just started using R so may be missing some obvious solution about making the desired object "relocate" available. However, off the top of my head, it seems "relocate" would need a reference prepend of dplry or the like.

Looking forward to seeing this work! I have been doing Ngram and other linguistic work lately (whew---the Google Million raw datasets are really bad, full of foreign unicode, freq counts for all caps of same low case words, a real pain to use for any serious work).

Am running R version 3.6.1 (2019-07-05) on Ubuntu Linux 18.04 on a Dell Precision laptop. I have all the packages listed in your August 24, 2020 document:
Imports httr, rlang, RCurl, dplyr, cli, tibble, tidyr, rjson, stringr,
ggplot2, scales, xml2, textutils, lifecycle
Thanks,
Dalton

What is the best default for the smoothing parameter?

At the moment smoothing is set to 3 by default, consistent with the default on the Google Ngram Viewer page. While that works well with lines, now that ggram accepts arbitrary geoms, this smoothing default is not so good with step or point. Should the default be changed to 0? Should the default be 3 for line and 0 for other geoms?

R and Google plots not lining up

Hi, and thanks for creating this package! I'm having trouble getting it to work for me, and I hope you can offer some advice.

I've got it to the point where it downloads and plots the ngram data for me, but the plot really doesn't resemble the equivalent (?) plot I'm getting from Google. The Google graph is here:

https://books.google.com/ngrams/graph?content=%22international+order%22%2C%22international+institutions%22%2C%22international+regimes%22&year_start=1900&year_end=2019&corpus=26&smoothing=2&case_insensitive=true#

I've tried to reproduce it with the following code, borrowed / modified from Daisung Jang's tutorial at https://daisungjang.com/tutorial/Ngram_tutorial.html:

library(ngramr)
data <- as.data.frame(matrix(ncol=1, nrow=109))
data$V1 <- seq(from=1900, to=2008)
names(data)[names(data)=="V1"] <- "Year"
search_terms <- c("international order", "international institutions", "international regimes")

for(i in 1:length(search_terms)){
  
  # Get each search term and store those in objects
  term <- search_terms[i]
  
  # Search for the term in the English 2012 corpus, starting from the year 1900 to 2008
  # Then house the output in a dataframe
  temp <- ngram(term, year_start = 1900, corpus="eng_2019", smoothing = 2)
  
  # Merge NYT data with dataframe created step 1, matching by years
  data <- merge(data, temp[,c("Year", "Frequency")], by ="Year", all.x=TRUE)
  
  # Reaname column by search term
  colname <- paste(term, sep="")
  
  # Rename added column with ID
  names(data)[names(data)=="Frequency"] <- colname
  
}

data_long <- reshape(data, 
                     varying = c("international order", "international institutions", "international regimes"), 
                     v.names = "Frequency",
                     timevar = "search_term", 
                     times = c("international order", "international institutions", "international regimes"), 
                     direction = "long")

library(ggplot2)

p <- ggplot(data_long, aes(x=Year, y=Frequency, group=search_term))

p +  geom_line(aes(colour = search_term))

As you'll see, the trends look very different. In the Google version, there's a surge in the use of the phrase "international institutions" after WWII; in the R version, there's a nearly identical surge, but in the use of a different phrase, "international order." That term then more or less flatlines in the Google version but continues to climb in the R version. The curve for "international regimes" is approximately right, but not exactly, and it maps to about the same y-axis scale as it does on the Google version, while the others appear to be on very different scales.

All in all, there are enough similarities to make me suspect that I'm more or less on the right track, but the overall pictures are dramatically different. I've tried varying all the ngramr parameters that I can find, but no combination I've tried produces a graph that looks like Google's. Any help appreciated!, and apologies in advance if this is a me problem.

"Error parsing ngram data, please contact package maintainer."

Hi there. I've encountered the following error when making a standard query.

> library(ngramr)
> 
> ng  <- ngrami("human capital", year_start = 1800, smoothing = 0)
Error parsing ngram data, please contact package maintainer.
Here's the original error message:
subscript out of bounds
Error occurred in the following code:
[[stringr::str_split(grep("drawD3Chart", years, value = TRUE), ",")1

Any idea how to resolve this?

Thanks for maintaining such a nice package!

Release on CRAN

I have submitted it. Fingers crossed!

ngramr not working behind corporate proxy

It looks as though a recent change from the Rcurl to curl package can affect internet access behind a proxy

Add warnings when Google changes the words due to punctuation

As an example, Google returns the following message

Note: the Ngram Viewer treats quotation marks literally.
Replaced door-handle with door - handle to match how we processed the books.
Replaced host's with host 's to match how we processed the books.

with this query.

ngramr not working

Hi, your great package seems to be not working. Many thanks for your maintaining this package.

ng <- ngram(c("hacker", "programmer"), year_start = 1950)
Error parsing ngram data, please contact package maintainer.
Here's the original error message:
subscript out of bounds
Error occurred in the following code:
[[stringr::str_split(grep("drawD3Chart", years, value = TRUE), ",")1

ngramw doesn't work

Changing to tidyr need to refactor this code

SSL certificate verification problem on Mac OS 10.10 Yosemite

Getting a little ahead of things since Yosemite is still in beta, but ... calls to ngram() fail when running on Yosemite with the following error:

> library(ngramr)
> ngram("programmer")
Error in function (type, msg, asError = TRUE)  : 
  SSL: certificate verification failed (result: 5)

Not a showstopper since I still have 10.9 systems, of course, but thought you might want to know. Thanks for all your work on the package!

Here's information about my setup. Short version: R 3.1.1, RStudio 0.98.1028, Mac OS 10.10, ngramr 1.4.3. Full output of version:

platform       x86_64-apple-darwin13.1.0   
arch           x86_64                      
os             darwin13.1.0                
system         x86_64, darwin13.1.0        
status                                     
major          3                           
minor          1.1                         
year           2014                        
month          07                          
day            10                          
svn rev        66115                       
language       R                           
version.string R version 3.1.1 (2014-07-10)
nickname       Sock it to Me

Download Parsing

Thank you so much for providing this package! I am looking for an opportunity, to download the frequency of many words at the same time and to get the information about their position in the sentences where they are used. For instance: I would like to be able to compare the frequence of "apple" being used as an subject or an object. Is there a possibility to adjust your code and to do so, and further, to also proceed with more calculations than just plotting the frequencies?

All the best and thank you in advance!

Error on All Queries - "CHAR() can only be applied to a 'CHARSXP', not to a 'NULL'

Just installed the package today, and ran into this error. Is this to do with my installation, or perhaps Google have changed something on their end?

Error in fromJSON(sub(".*=", "", html[data_line])) : CHAR() can only be applied to a 'CHARSXP', not a 'NULL'

Traceback:
8. | fromJSON(sub(".*=", "", html[data_line]))
7. ngram_parse(html)
6. | ngram_fetch(phrase, corpus_n, case_ins, ...)
5. | ngram_single(phrases, corpus = corp, year_start = year_start, year_end = year_end, smoothing = smoothing, tag = tag, case_ins)
4. FUN(X[[i]], ...)
3. | lapply(corpus, function(corp) ngram_single(phrases, corpus = corp, year_start = year_start, year_end = year_end, smoothing = smoothing, tag = tag, case_ins))
2. ngram(phrases, ...)

| ngramr::ggram("dog")

Not getting an answer from google

I get this error while running my code:

Warning: NAs introduced by coercionError in stringr::str_split(grep("drawD3Chart", json, value = TRUE), ",")[[1]] :
subscript out of bounds

All my phrases have plots in n-gram. My code was working a week ago now it just sends me errors.

Has the wildcard syntax changed?

Hi,

In 1.5.x, the syntax below worked just fine for me

    ging <- paste0(ip, " *")
    ng <- ngramr::ngram(ging, year_start = 1950)

where ip is a string so that ging could be (say)
"a bachelor's degree in *".

If I go to Google & the ngram viewer, this string
will return results.

But from R, with 1.7.x of the package, I'm getting
the warning

The characters +, -, *, / require parentheses to be interpreted as a composition.

Have since tried
"(a bachelor's degree in) *"
"((a bachelor's degree in) *)"
& others, but without success.

Deal with parentheses

In the new google website, a string like this
(The United States is + The United States has) / The United States
is replaced with
((The United States is + The United States has) / The United States)

However, the former will result in an error in ngramr. This condition needs to be trapped and the outer parentheses added.

Fix gridlines in google_theme

Axis labels and gridlines are mis-aligned.

2012 and 2019 corpus counts the same

This doesn't seem right - data needs to be corrected.

Proxy problems

With the change to using RCurl to access the Google SSL pages, ngramr no longer works behind a proxy. Need to add options to configure the proxy.

Image hosting

Thanks for the README update @briatte. Where/how are you hosting the images?

Is require(scales) required?

I had added a require(scales) line in ggram as suggested in an earlier issue raised by @briatte but that did not pass check packages. Is it really necessary? Is there another way to deal with the issue?

Unicode problems

The Google Ngram Viewer allows some fancy phrases such as "fancy=>pants". In the Javascript this comes back as ''fancy\u003D\u003Epants" and so I need to convert this Unicode encoding back to ASCII. I am stumped.

More flexibility in specifying corpuses

As well as a corpus argument, it would be good to be able to specify edition (2009 or 2012) and language (e.g. eng) separately,

Install instructions should read install_github("seancarmody/ngramr")

Currently reads

require(devtools)
install_github("ngramr", "seancarmody")
require(ngramr)

should read

require(devtools)
install_github("seancarmody/ngramr")
require(ngramr)

Error in open.connection(x, "rb")

Hi there,

first of all, thank you so much for this library!! I discovered it recently and it really made my month!!!! :-)

I am opening this issue because I am encountering a connection error when trying to run ngram() within a loop over a large number of ngrams.

I have more than 600 Ngrams to go through, and a 'for' loop would work perfectly for me, except, that the connection to the google.book webpage seems to be lost after about 75 Ngrams.

I thought you may have a hint for me to try and fix this issue...

Any clue would be greatly appreciated!
Thanks again!

Aurelie

Here is the error message I get

Error in open.connection(x, "rb") : 
  cannot open the connection to 'https://books.google.com/ngrams/graph?content=lenteur+s%27oppose+%C3%A0+la+rapidit%C3%A9&corpus=30&year_start=1800&year_end=2019&smoothing=0&case_insensitive=on'

Here is some reproducible code:

Ngram_data <- structure(list(id_overall = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 
11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 
27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 
43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 
59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 
75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 
91, 92, 93, 94, 95, 96, 97, 98, 99, 100), Xgram_pattern = c("écolier marche sur la banquise", 
"métal sert à faire des bouchons", "canaille utilise sa carapace", 
"chiffre se multiplie par la largeur", "dicton est un proverbe", 
"éléphant utilise sa trompe", "entaille est une coupure", "estomac permet la digestion", 
"seau permet un bon arrosage", "châle est mis pour le carnaval", 
"mouches envahissent ce camping", "gratin est garni de jambon", 
"entretien donne suite à une embauche", "formule est utilisée en mathématiques", 
"cigale demande à la fourmi", "habitudes entrainent une routine", 
"blason est un aigle", "main remplace la rame", "terreau se mélange au thym", 
"papyrus est une copie", "batterie remplace les piles", "beurre est du gras", 
"biche donne naissance à un faon", "bougie s'écoule de la cire", 
"disgrâce n'est pas une qualité", "matériel mesure la vitesse", 
"compère emmène sa maîtresse", "benne ramasse les ordures", 
"gruyère est rangé avec les fromages", "brevet est obtenu à la fin du collège", 
"tour est construite par une araignée", "torchons sont rangés avec les serviettes", 
"enceinte regorge de lions", "garçon indique la note", "matière est utilisée avec la règle", 
"sourd est également muet", "soutien est une aide", "tribunal est le lieu de travail des juges", 
"caoutchouc sert à fermer le pot", "viande est du boeuf", "préposés ordonnent aux enfants", 
"tache provient du soleil", "évènement a été provoqué par une voiture", 
"fichier est constitué de plusieurs pièces", "cendrier contient des cigarettes", 
"cernes traduisent un manque de sommeil", "nid a été construit par un oiseau", 
"tarot se pratique avec des cartes", "voisine lit mon avenir", 
"album contient des photos", "coffret a été ouvert à la mort", 
"dessin représente son amour", "minuteur sonne le matin", "voisin n'est pas un ami", 
"songe est un rêve", "plombage est posé sur la dent", "épouse demande à son mari", 
"oncle se promène avec ma tante", "virtuoses éclairent la nuit", 
"bague entoure le doigt", "cageot regorge de raisin", "mouvement se fait dans la piscine", 
"bête mange des carottes", "vermine a mangé la noisette", "flageolet est un haricot", 
"framboise ressemble à une fraise", "gigot provient d'un agneau", 
"goéland ressemble à une mouette", "loisir se joue avec des baguettes", 
"chimère utilise ses tentacules", "village est construit par des esquimaux", 
"barque permet le sauvetage", "informaticien répare des ordinateurs", 
"innocent s'oppose à un coupable", "hamac permet de faire une sieste", 
"lenteur s'oppose à la rapidité", "personne est une star", 
"insecte est un ver", "homme utilise la magie", "planche permet une nage", 
"caleçon remplace le slip", "canards se baignent dans la mare", 
"challenge est un défi", "chouette diffère du hibou", "truc est un couteau", 
"plaintes témoignent de sa tristesse", "politiciens marquent les limites", 
"patron indique aux employés", "bombe entraine une explosion", 
"arc est utilisé avec des flèches", "congrès précède le vendredi", 
"firme est une entreprise", "bracelet est fabriqué avec du sucre", 
"place accueille les fous", "salon est un bar", "tétine est saisie par le bébé", 
"rosier produit des roses", "beau retrouve sa belle", "terrain est un parc", 
"carré diffère d'un rond"), Xgram_dependencies = c("banquise=>écolier", 
"bouchons=>métal", "carapace=>canaille", "largeur=>chiffre", 
"proverbe=>dicton", "trompe=>éléphant", "coupure=>entaille", 
"digestion=>estomac", "arrosage=>seau", "carnaval=>châle", "camping=>mouches", 
"jambon=>gratin", "embauche=>entretien", "mathématiques=>formule", 
"fourmi=>cigale", "routine=>habitudes", "aigle=>blason", "rame=>main", 
"thym=>terreau", "copie=>papyrus", "piles=>batterie", "gras=>beurre", 
"faon=>biche", "cire=>bougie", "qualité=>disgrâce", "vitesse=>matériel", 
"maîtresse=>compère", "ordures=>benne", "fromages=>gruyère", 
"collège=>brevet", "araignée=>tour", "serviettes=>torchons", 
"lions=>enceinte", "note=>garçon", "règle=>matière", "muet=>sourd", 
"aide=>soutien", "juges=>tribunal", "pot=>caoutchouc", "boeuf=>viande", 
"enfants=>préposés", "soleil=>tache", "voiture=>évènement", 
"pièces=>fichier", "cigarettes=>cendrier", "sommeil=>cernes", 
"oiseau=>nid", "cartes=>tarot", "avenir=>voisine", "photos=>album", 
"mort=>coffret", "amour=>dessin", "matin=>minuteur", "ami=>voisin", 
"rêve=>songe", "dent=>plombage", "mari=>épouse", "tante=>oncle", 
"nuit=>virtuoses", "doigt=>bague", "raisin=>cageot", "piscine=>mouvement", 
"carottes=>bête", "noisette=>vermine", "haricot=>flageolet", 
"fraise=>framboise", "agneau=>gigot", "mouette=>goéland", "baguettes=>loisir", 
"tentacules=>chimère", "esquimaux=>village", "sauvetage=>barque", 
"ordinateurs=>informaticien", "coupable=>innocent", "sieste=>hamac", 
"rapidité=>lenteur", "star=>personne", "ver=>insecte", "magie=>homme", 
"nage=>planche", "slip=>caleçon", "mare=>canards", "défi=>challenge", 
"hibou=>chouette", "couteau=>truc", "tristesse=>plaintes", "limites=>politiciens", 
"employés=>patron", "explosion=>bombe", "flèches=>arc", "vendredi=>congrès", 
"entreprise=>firme", "sucre=>bracelet", "fous=>place", "bar=>salon", 
"bébé=>tétine", "roses=>rosier", "belle=>beau", "parc=>terrain", 
"rond=>carré")), row.names = c(NA, -100L), class = c("tbl_df", 
"tbl", "data.frame"))


# extract and store Google Ngram frequencies for each Ngram_pattern
for (i in 1:nrow(Ngram_data)) 
  {
  # launch Google Ngram query and extract the results that are displayed with viewer
  google_ngram_tmp <- ngram(Ngram_data$Xgram_pattern[i],  
                            corpus = "fre_2019",
                            year_start = 1800,
                            year_end = 2019,
                            smoothing = 0,
                            case_ins = TRUE, 
                            aggregate = TRUE)

  closeAllConnections()
   
  # returns the Ngram index at each iteration to keep track of the computation
  print (paste("Ngram index", i, sep = ": ") )
  # print(showConnections(all = FALSE))
}

Remove scales comment

ngram case-sensitive output

Hello,

Great package and has been very helpful already! One minor bug, when using ngram(..., case_ins = TRUE) the correct data is extracted however the automatic output still states "Case-sensitive: TRUE".

This may be intentional since the freq are indeed case sensitive since it just extracts both the freq for lowercase and the freq for uppercase. I just thought I'd mention it because I'm not sure what this part of the output would be used for otherwise. Thanks for your contribution to the R community!

Blessings,
David

Counts question

I'm trying to use the 'counts' parameter to derive the non-fiction frequencies from the eng_2019 and eng_fiction_2019 data. My assumption was that the eng_fiction_2019 count would always be less than or equal to the eng_2019 count. This does not appear to be the case in all instances.

I'm also assuming that count/frequency is the total and that the differences between the counts and totals allows me to calculate the non-fiction frequency.

Have I got this wrong?

My test case is (html + HTML).

Cheers,
Andrew

NGRAM fails to install osx 12.6

I have tried (install_packages)(Install_local)
All return as this is not a mac binary package
Error in install.packages : file ‘/var/folders/dq/kg1jg0jj7pldyf4k_30pqgg80000gn/T//Rtmp3rwJw3/downloaded_packages/cli_3.4.1.tgz’ is not a macOS binary package

I am using R 4.1

Recent versions of ngramr failing simple example test

This simple call from the README seems to now systematically return a NULL:
ng <- ngram(c("hacker", "programmer"), year_start = 1950)

I have tried the current dev version 1.9.2 as well as 1.9.1 and 1.9.0, with the same result.

It seems to me that the query gets formed properly, I was able to land to the right page using the query that was generated. Pulling the html data does not generate an error, and the html itself appears to hold the data we want to extract. It's when the html data is fed into the function ngram_fetch_data that apparently there is an issue. I don't know html/xml well enough to identify the issue, but I presume that Google must have changed something.

Submit to the GGally package?

Perhaps you could add this function to the other ggplot2 helpers in the GGally package.

Allow ggram to call ngrami

It would be interesting if ggram could get an ignore.case = TRUE option that would call ngrami instead of ngram. I am not sure I understand the code well enough to implement this (the smoothing and aggregate arguments are mysterious to me).

problem with stringr

When I run these commands:

library (ngramr)
ng <- ngram(c("hacker", "programmer"), year_start = 1950)

I get this message:

Error in stringr::str_split(grep("drawD3Chart", json, value = TRUE), ",")[[1]] :
subscript out of bounds
In addition: Warning message:
In ngram_fetch_data(html) : NAs introduced by coercion