ropensci-archive / crminer Goto Github PK

View Code? Open in Web Editor NEW

17.0 4.0 5.0 9.91 MB

:no_entry: ARCHIVED :no_entry: Fetch 'Scholary' Full Text from 'Crossref'

License: Other

R 99.03% Makefile 0.97%

text-mining crossref literature r rstats r-package

crminer's People

Contributors

Stargazers

Watchers

Forkers

behrica ktargows graceli8 njahn82 serghiou

crminer's Issues

pensoft - http->https

for crm_links, need to change http to https in links from pensoft. crossref API has http links.

note: having ssl cert issues with requests to these links though, may be temporary

wiley full text issue

Session Info

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.4

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 
 
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.8.5        rcrossref_1.0.0.91 crminer_0.3.2.91  

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6      plyr_1.8.6        pillar_1.4.3      compiler_3.6.3    later_1.0.0      
 [6] pdftools_2.3      tools_3.6.3       digest_0.6.25     jsonlite_1.6.1    lifecycle_0.2.0  
[11] tibble_3.0.1      pkgconfig_2.0.3   rlang_0.4.5       bibtex_0.4.2.2    shiny_1.4.0.2    
[16] rstudioapi_0.11   curl_4.3          crul_0.9.0        fastmap_1.0.1     xml2_1.3.2       
[21] stringr_1.4.0     vctrs_0.2.4       htmlwidgets_1.5.1 askpass_1.1       rappdirs_0.3.1   
[26] triebeard_0.3.0   DT_0.13           tidyselect_1.0.0  httpcode_0.3.0    glue_1.4.0       
[31] qpdf_1.1          R6_2.4.1          tidyr_1.0.2       purrr_0.3.4       hoardr_0.5.2     
[36] magrittr_1.5      urltools_1.7.3    promises_1.1.0    ellipsis_0.3.0    htmltools_0.4.0  
[41] assertthat_0.2.1  mime_0.9          xtable_1.8-4      httpuv_1.5.2      stringi_1.4.6    
[46] miniUI_0.1.1.1    crayon_1.3.4

In the following example, it seems that I cannot extract the content from PDF for some Wiley articles.

> doi <- "10.1111/1477-9552.12353"
> l <- crm_links(doi)
$pdf
<url> https://onlinelibrary.wiley.com/doi/pdf/10.1111/1477-9552.12353

$xml
<url> https://onlinelibrary.wiley.com/doi/full-xml/10.1111/1477-9552.12353

$unspecified
<url> https://onlinelibrary.wiley.com/doi/pdf/10.1111/1477-9552.12353

> crm_pdf(l, overwrite_unspecified = T)
Downloading pdf...
Extracting text from pdf...
PDF error: May not be a PDF file (continuing anyway)
PDF error (6): Illegal character <21> in hex string
PDF error (8): Illegal character <4f> in hex string
.........
PDF error (588): Illegal character <22> in hex string
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't read xref table
Error in poppler_pdf_info(loadfile(pdf), opw, upw) : PDF parsing failure.

My understanding of this error message is that the download was completed and the crm_text() (crm_pdf()) function moves on to extract its content and encounters some error there. I have tried both pdf and unspecified link (they are essentially the same link) and both give the same error message.

Just for some further tests, I copied and pasted the link into web browser and downloaded the PDF. And then I ran

pdftools::pdf_text("~/Downloads/1477-9552.12353.pdf")

This gave me the correct results.

crm_links should always return the same class type

right now, varies by input and what happens internally, always return a list

add .github folder

crm_text/etc fails for some Wiley DOIs

inside of cr_auth, we need to fix - if a DOI owner is a different publisher than the article for that DOI lives at , then we run into a problem because cr_auth only handles specific publishers.

e.g. 10.1111/j.1468-0297.1997.tb00019.x - the crossref member is Cambridge, but the article for that DOI lives at Wiley

replace httr with crul

fix tests failing on CRAN

Possibly make types="all" the default instead of "xml" for cr_ft_links()

transferred from ropensci/rcrossref#109

e.g., would esp. helpful with Wiley where they give unspecified as the content-type

via ropensci/rcrossref#107

support auth for TDM

transferred from ropensci/rcrossref#75

related to ropensci/rcrossref#72

elsevier now with rate limit headers

test out

elsevier first page thing

when you don't have access to a paper, at least for some papers, elsevier gives a 200 response and returns the first page only, and gives header:

< X-ELS-Status: WARNING - Response limited to first page because requestor not entitled to resource

crminer should look for this header and warn the user about this
should we keep the 1st page and leave the request as a successful request, or detect the header and fail out and remove the file with the 1st page?

any thoughts @mark-fangzhou-xie ?

crm_text() parameter for adding a browser useragent string?

via #41

Crossref TDM click through service going away end of 2020

Transition away from using click through keys.

Maybe add a warning/message telling users that the click through keys will be useless 2021 and beyond, but ideally need to be able to say what they can use instead.

In 2021, remove support for keys, and then error if a user uses a click through key

make_file_path internal fxn: handle weird urls

an example we have in the docs

dois_crminer_ccby3[40]
#> "10.1088/1742-6596/689/1/012019"
links <- crm_links(dois_crminer_ccby3[40], "all")
# crm_text(links, 'pdf')

example of calling crm_text is commented out cause it 404s, which is fine, but the file is not deleted on exit as it should be.

the url it gives is http://stacks.iop.org/1742-6596/689/i=1/a=012019/pdf - which causes a problem in the internal fxn make_file_path() which results in a filename of pdf

Add caching for xml and plain text

just implemented for pdfs for now

move covr to travis only

crm_links not using email in header for polilte pool

crm_text - Specifying absolute path and file names for PDFs

Hi @sckott,

Thank you for the great package! I was wondering if there is a way to:

save the downloaded pdfs outside the cache location
rename the PDFs

If not, would you recommend to usecrm_cache$list() to keep the link between the reference information and the downloaded filenames?

Thanks!
Julien

tidy up crm_text/etc. fxns

Should ideally handle URL as string or output of crm_links - but doesn't really handle URL as string right now,

use s3 methods for this, also will prevent esaily passing in other classes with good message

curl problem

I have this type of error when trying to get full-text data.
this is my command:

crm_text(crm_links('10.1016/s1090-9516(97)90007-9'))

And this is my error:

Error in curl::curl_fetch_memory(x$url$url, handle = x$url$handle) :
Protocol "httpss" not supported or disabled in libcurl

This is my sessions Info:

sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] crminer_0.2.0 lubridate_1.7.4 XML_3.99-0.3 svMisc_1.1.0 rplos_0.8.6
[6] aRxiv_0.5.19 rcrossref_0.9.2 fulltext_1.4.0

loaded via a namespace (and not attached):
[1] Rcpp_1.0.3 pdftools_2.3 utf8_1.1.4 assertthat_0.2.1
[5] digest_0.6.25 mime_0.9 R6_2.4.1 plyr_1.8.5
[9] httr_1.4.1 ggplot2_3.2.1 pillar_1.4.3 rlang_0.4.4
[13] lazyeval_0.2.2 curl_4.3 rstudioapi_0.11 miniUI_0.1.1.1
[17] whisker_0.4 rentrez_1.2.2 DT_0.12 qpdf_1.1
[21] urltools_1.7.3 stringr_1.4.0 htmlwidgets_1.5.1 triebeard_0.3.0
[25] munsell_0.5.0 shiny_1.4.0 compiler_3.6.1 httpuv_1.5.2
[29] pkgconfig_2.0.3 askpass_1.1 htmltools_0.4.0 tidyselect_1.0.0
[33] tibble_2.1.3 solrium_1.1.4 httpcode_0.2.0 microdemic_0.5.0
[37] fansi_0.4.1 crayon_1.3.4 dplyr_0.8.4 hoardr_0.5.2
[41] later_1.0.0 rappdirs_0.3.1 crul_0.9.0 grid_3.6.1
[45] jsonlite_1.6.1 xtable_1.8-4 gtable_0.3.0 lifecycle_0.1.0
[49] magrittr_1.5 storr_1.2.1 scales_1.1.0 bibtex_0.4.2.2
[53] cli_2.0.2 stringi_1.4.6 reshape2_1.4.3 fauxpas_0.2.0
[57] promises_1.1.0 xml2_1.2.2 vctrs_0.2.3 tools_3.6.1
[61] glue_1.3.1 purrr_0.3.3 fastmap_1.0.1 colorspace_1.4-1

Crossref API reports for some Wiley full text pdf's the mimetype 'unspecified', and therefore crm_pdf fails

This does work:

l <- crm_links("10.2903/j.efsa.2016.4556",type="all")
crm_pdf(l)

while this not:

l <- crm_links("10.2903/j.efsa.2014.3550",type="all")
crm_pdf(l)

The root cause for the error, is that crossref API does return content-type 'unspecified' for
the second case.

"message" : {
      "link" : [
         {
            "URL" : "https://api.wiley.com/onlinelibrary/tdm/v1/articles/10.2903%2Fj.efsa.2014.3550",
            "content-version" : "vor",
            "content-type" : "unspecified",
            "intended-application" : "text-mining"
         }
      ],

I can get it working by overring manualy the content type, like this:

l <- setNames(l, "pdf")
  attr(l, "type") <- "pdf"
  text <- crm_text(l, type = "pdf")

but this is of course a hack.

After investigation the code, I think there is a inconsitency between crm_links() which returns a mime-type 'unspecified' and the crm_text method, which cannot handle 'unspecified'.

I think crm_text() should be changed to be able to handle 'unspecified' and just plainly download the file and write it to disk, such in the same way how a download with 'curl' does it. Using curl the problem of mime-type = unspecified does not stop me from downloading.

I will take a look at this and send a PR, if I find a solution

Sage Journals

Thank you for the library - its really great.

Does it allow the download of research papers from Sage Journals?

According to the Sage Journals Text and Data Mining policy, it is recommended to download articles through the CrossRef Text and Data Mining API.

I was therefore hoping to pass the DOI for papers from SAGE Journals, but unsure if/how to insert or obtain an authentication token?

Kind Regards

fix tests

skip and/or use vcr for http request calls

Attempt to support rate limiting for TDM

transferred from ropensci/rcrossref#67

for tests that write to disk - use relative paths

fix in webmockr and vcr coming in their next versions, wait for those next versions to be up

ropensci/webmockr#95
ropensci/vcr#135

elsevier full text issues

Use case from email.

User gave examples of DOIs for journal they access to, and can acces the PDFs in the browser, but via API calls can not access full text. The non-accessible via API DOIs appear to be all inthe range 1993-2003. here's 5 example DOIs for this scenario

10.1006/jeth.1993.1066
10.1006/jeth.1994.1072
10.1006/jeth.1995.1078
10.1006/jeth.1996.0123
10.1006/jeth.1997.2332

The PDFs for these DOIs do exist, but as far as I can tell there;s no way to figure out the URLs for those PDFs

use hoardr for cache path

Chicago press full text issue

Session Info

crminer_0.3.5.91

> library(crminer)
> link <- crm_links("10.1086/250113")
> link
$unspecified
<url> http://www.journals.uchicago.edu/doi/pdf/10.1086/250113

> ft <- crm_text(link, "pdf", overwrite_unspecified = T)
using cached file: /Users/xiefangzhou/Library/Caches/R/crminer/250113.pdf
date created (size, mb): 2020-06-12 22:59:56 (0)
Extracting text from pdf...
Error in poppler_pdf_info(loadfile(pdf), opw, upw) : PDF parsing failure.

Sorry for posting this, as this is clearly similar to #41 here and others, but this time it happens for U Chicago Press. The full-text link can be copied and pasted to a web browser and opened as a PDF file.

xml_find_one -> xml_find_first

add raw as another optional ionput to `crm_extract`

use case in fulltext where it'd be nice to be able to do so.

TDM: elife is a problem

transferred from ropensci/rcrossref#41

see convo there

PDFs from Wiley

Hello,

I'm trying to download a list of all journal articles from the Wiley journal 'Global Ecology & Biogeography' for some subsequent text mining analysis. My current aim is to download the PDFs for the DOIs. I believe I've managed to get a list of DOIs using the following code:

library(rcrossref)
library(crminer)
library(purrr)
library(dplyr)

geb_papers <- cr_works(filter = c(issn='1466-822X')) 

n_pages <- ceiling(geb_papers$meta$total_results / geb_papers$meta$items_per_page)

geb_dois<- map_dfr(1:n_pages, function(x) {
  # get the start record for the page
  strt <- (x - 1)*20 + 1
  dois <- cr_works(filter = c(issn='1466-8238'), offset = strt)$data %>% 
    filter(type == "journal-article")
  return(dois)
})

dois<-geb_dois$doi

So far, so good (I think). The next stage is to then use crm_links() on the list of DOIs like so:

links<-lapply(dois, crm_links, type="pdf")

If I query the object links I notice that there is no PDF link for thie first item in the list. If we trace back the DOI for this article we get:

dois[[1]]
[1] "10.1111/geb.12139"

Stick that DOI into a browser and you'll be taken to the article here. So there's something there and, in theory, should be downloadable, correct?

Any ideas what may be wrong? The above code should be reproducible so I was hoping some kind person would be able to spare 5 minutes to run it on a subset of the data (perhaps the first 10 entries to speed things up) to see whether they hit the same issue. Out of the first 20 elements in the list links I have 9 articles missing PDF links, which is quite a high percentage.

To confirm, I have an auth token for crossref in .Renviron, so I don't think it's an authentication issue. This is also backed up with the fact that the second element in links has a PDF link and can be downloaded when using crm_pdf(links[[2]]).

Thanks in advance!
Simon

Some crm_pdf/cr_text tests failing

Probably related to many recent changes in crm_text and crm_pdf, but haven't been able to sort out whats going on. Seems fine when commenting out the vcr usage though, so maybe something to do with file caching/writing to disk.

click through...

Using the Click-Through Service

Some publishers will require you to use the CrossRef click-through service. This allows you to agree to supplementary licenses. For more information see the Click-Through Service documentation. When you use the click-through service you will be given a token. You should supply this as a header when you make the query to full-text. Here is an example request using a click-through service token:

curl -H "CR-TDM-Client-Token: hZqJDbcbKSSRgRG_PJxSBAx" -k "https://annalsofpsychoceramics.labs.crossref.org/fulltext/515151" -D - -L -O

not sure what i was originally trying to say here, i guess just make sure to support click through as well as possible

Add a vignette

This package needs more documentation! Help out the community by contributing a vignette. If you don't know what a vignette is, check out http://r-pkgs.had.co.nz/vignettes.html for an introduction.

If you aren't sure how to contribute on github checkout https://github.com/ropensci/crminer/blob/master/.github/CONTRIBUTING.md

Keep in mind our code of conduct https://github.com/ropensci/crminer/blob/master/CONDUCT.md

crm_links: return intended application

in attributes

in case it's needed downstream

remove unspecified from type param in crm_text

"The user shouldn't be able to pass in unspecified, but some URLs passed in as a result of crm_links will have unspecified, but the type parameter is meant to say which type you want if there are many options (e.g, xml if there's plain and xml), or to override the unspecified type (e.g., you know the link is for pdf, so put type = "pdf" AND overwrite_unspecified = TRUE)"

oxford full text issue

Session Info

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.4

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 
 
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.6.3 tools_3.6.3

Sorry for my repeated posting issues. This time I am working on journals from Oxford Press.

> l <- crm_links("10.1093/icc/4.1.1-a")
> l
$unspecified
<url> http://academic.oup.com/icc/article-pdf/4/1/1/6768751/4-1-1b.pdf
> crm_text(l, "pdf", overwrite_unspecified = T)
Downloading pdf...
Error in curl::curl_fetch_disk(x$url$url, x$disk, handle = x$url$handle) : 
  Recv failure: Operation timed out

I can confirm that I can open this link in a browser, but calling crm_text() function will throw timeout error. I tried to use curl -o in the terminal but was having the same timeout error.

I then tried to run RSelenium browser and fetch that full-text link. It displayed the article (in PDF) properly in the automated chromedriver.

library(RSelenium)
browser <- remoteDriver(port = 5556, browserName="chrome")
browser$open()
browser$navigate( as.character(l$unspecified))

I think that their server has some JavaScript testing, and the curl-based HTTP requests will fail. (I am not very familiar with this in R, but I guess it is the same as the Python "requests" package that they cannot deal with dynamic-rendered elements.) I believe the current work-around would be using RSelenium, download the PDF, and then extract plain text from it.

I wonder if there are better methods to deal with this without using Selenium?

crminer generates invalid link for DOIs from Wiley

DOI <- c("10.1007/S10531-017-1376-Y","10.1002/ECS2.1309","10.1614/IPSM-D-14-00048.1","10.1890/14-0922.1","10.1093/AOBPLA/PLU081","10.1007/S10530-014-0705-2","10.2111/REM-D-13-00140.1")
links <- sapply(DOI, crminer::crm_links)

Above is a list of DOIs, some of which are from Wiley. crminer will generate links from the Wiley DOIs, but the links labeled as $.pdf are invalid. In some cases crminer generates valid links labeled as unspecified but in some cases it doesn't and I can't figure out enough of a pattern to exploit that usefully.

cambridge full text issue

Session Info

R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.4

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 
 
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] crminer_0.3.3.93

loaded via a namespace (and not attached):
 [1] hoardr_0.5.2    compiler_3.6.3  R6_2.4.1        tools_3.6.3     httpcode_0.3.0  curl_4.3       
 [7] rappdirs_0.3.1  Rcpp_1.0.4.6    urltools_1.7.3  pdftools_2.3    triebeard_0.3.0 crul_0.9.0     
[13] qpdf_1.1        jsonlite_1.6.1  digest_0.6.25   askpass_1.1

> doi <- "10.1017/s0081305200012255"
> link <- crm_links(doi)
> crm_text(link)
Error in crm_text.list(link) : no links for type xml
> link
$unspecified
<url> https://www.cambridge.org/core/services/aop-cambridge-core/content/view/S0081305200012255

> crm_text(link, "pdf", overwrite_unspecified = T)
Downloading pdf...
Extracting text from pdf...
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't read xref table
Error in poppler_pdf_info(loadfile(pdf), opw, upw) : PDF parsing failure.

I think this is connected to #41 , #40 ?

Change caching paths to make cran happy

cr_ft_links output NULL, even if a PDF exists and DOI leads directly to it

moved from ropensci/rcrossref#107

Comment from @finsoc

see ropensci/rcrossref#107 for discussion

use markdown docs

crm_plain doesn't seem to work with url input for elsevier

and possibly others, e.g,

url <- "https://api.elsevier.com/content/article/PII:S0370269310012608?httpAccept=text/plain"
crm_plain(url)
#> [1] "<service-error><status><statusCode>INVALID_INPUT</statusCode><statusText>View parameter specified in request is not valid</statusText></status></service-error>"

im guessing its due to missing attributes (probably doi and crossref member in particular), see

link <- crm_links("10.1016/j.physletb.2010.10.049", "plain")
z <- as_tdmurl(url, "plain")

str(link)
#> List of 1
#>  $ plain: chr "https://api.elsevier.com/content/article/PII:S0370269310012608?httpAccept=text/plain"
#>  - attr(*, "class")= chr "tdmurl"
#>  - attr(*, "type")= chr "plain"
#>  - attr(*, "doi")= chr "10.1016/j.physletb.2010.10.049"
#>  - attr(*, "member")= chr "78"
#>  - attr(*, "intended_application")= chr "text-mining"

str(z)
#> List of 1
#>  $ plain: chr "https://api.elsevier.com/content/article/PII:S0370269310012608?httpAccept=text/plain"
#>  - attr(*, "class")= chr "tdmurl"
#>  - attr(*, "type")= chr "plain"

ropensci-archive / crminer Goto Github PK

crminer's People

Contributors

Stargazers

Watchers

Forkers

crminer's Issues

Recommend Projects

Recommend Topics

Recommend Org