ropensci-archive / crminer Goto Github PK
View Code? Open in Web Editor NEW:no_entry: ARCHIVED :no_entry: Fetch 'Scholary' Full Text from 'Crossref'
License: Other
:no_entry: ARCHIVED :no_entry: Fetch 'Scholary' Full Text from 'Crossref'
License: Other
for crm_links, need to change http to https in links from pensoft. crossref API has http links.
note: having ssl cert issues with requests to these links though, may be temporary
> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.4
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
Random number generation:
RNG: Mersenne-Twister
Normal: Inversion
Sample: Rounding
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.8.5 rcrossref_1.0.0.91 crminer_0.3.2.91
loaded via a namespace (and not attached):
[1] Rcpp_1.0.4.6 plyr_1.8.6 pillar_1.4.3 compiler_3.6.3 later_1.0.0
[6] pdftools_2.3 tools_3.6.3 digest_0.6.25 jsonlite_1.6.1 lifecycle_0.2.0
[11] tibble_3.0.1 pkgconfig_2.0.3 rlang_0.4.5 bibtex_0.4.2.2 shiny_1.4.0.2
[16] rstudioapi_0.11 curl_4.3 crul_0.9.0 fastmap_1.0.1 xml2_1.3.2
[21] stringr_1.4.0 vctrs_0.2.4 htmlwidgets_1.5.1 askpass_1.1 rappdirs_0.3.1
[26] triebeard_0.3.0 DT_0.13 tidyselect_1.0.0 httpcode_0.3.0 glue_1.4.0
[31] qpdf_1.1 R6_2.4.1 tidyr_1.0.2 purrr_0.3.4 hoardr_0.5.2
[36] magrittr_1.5 urltools_1.7.3 promises_1.1.0 ellipsis_0.3.0 htmltools_0.4.0
[41] assertthat_0.2.1 mime_0.9 xtable_1.8-4 httpuv_1.5.2 stringi_1.4.6
[46] miniUI_0.1.1.1 crayon_1.3.4
In the following example, it seems that I cannot extract the content from PDF for some Wiley articles.
> doi <- "10.1111/1477-9552.12353"
> l <- crm_links(doi)
$pdf
<url> https://onlinelibrary.wiley.com/doi/pdf/10.1111/1477-9552.12353
$xml
<url> https://onlinelibrary.wiley.com/doi/full-xml/10.1111/1477-9552.12353
$unspecified
<url> https://onlinelibrary.wiley.com/doi/pdf/10.1111/1477-9552.12353
> crm_pdf(l, overwrite_unspecified = T)
Downloading pdf...
Extracting text from pdf...
PDF error: May not be a PDF file (continuing anyway)
PDF error (6): Illegal character <21> in hex string
PDF error (8): Illegal character <4f> in hex string
.........
PDF error (588): Illegal character <22> in hex string
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't read xref table
Error in poppler_pdf_info(loadfile(pdf), opw, upw) : PDF parsing failure.
My understanding of this error message is that the download was completed and the crm_text()
(crm_pdf()
) function moves on to extract its content and encounters some error there. I have tried both pdf
and unspecified
link (they are essentially the same link) and both give the same error message.
Just for some further tests, I copied and pasted the link into web browser and downloaded the PDF. And then I ran
pdftools::pdf_text("~/Downloads/1477-9552.12353.pdf")
This gave me the correct results.
right now, varies by input and what happens internally, always return a list
inside of cr_auth
, we need to fix - if a DOI owner is a different publisher than the article for that DOI lives at , then we run into a problem because cr_auth
only handles specific publishers.
e.g. 10.1111/j.1468-0297.1997.tb00019.x - the crossref member is Cambridge, but the article for that DOI lives at Wiley
e.g., would esp. helpful with Wiley where they give unspecified
as the content-type
related to ropensci/rcrossref#72
test out
when you don't have access to a paper, at least for some papers, elsevier gives a 200 response and returns the first page only, and gives header:
< X-ELS-Status: WARNING - Response limited to first page because requestor not entitled to resource
any thoughts @mark-fangzhou-xie ?
via #41
Transition away from using click through keys.
Maybe add a warning/message telling users that the click through keys will be useless 2021 and beyond, but ideally need to be able to say what they can use instead.
In 2021, remove support for keys, and then error if a user uses a click through key
an example we have in the docs
dois_crminer_ccby3[40]
#> "10.1088/1742-6596/689/1/012019"
links <- crm_links(dois_crminer_ccby3[40], "all")
# crm_text(links, 'pdf')
example of calling crm_text
is commented out cause it 404s, which is fine, but the file is not deleted on exit as it should be.
the url it gives is http://stacks.iop.org/1742-6596/689/i=1/a=012019/pdf - which causes a problem in the internal fxn make_file_path()
which results in a filename of pdf
just implemented for pdfs for now
Hi @sckott,
Thank you for the great package! I was wondering if there is a way to:
If not, would you recommend to usecrm_cache$list()
to keep the link between the reference information and the downloaded filenames?
Thanks!
Julien
Should ideally handle URL as string or output of crm_links
- but doesn't really handle URL as string right now,
use s3 methods for this, also will prevent esaily passing in other classes with good message
I have this type of error when trying to get full-text data.
this is my command:
crm_text(crm_links('10.1016/s1090-9516(97)90007-9'))
And this is my error:
Error in curl::curl_fetch_memory(x$url$url, handle = x$url$handle) :
Protocol "httpss" not supported or disabled in libcurl
This is my sessions Info:
sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] crminer_0.2.0 lubridate_1.7.4 XML_3.99-0.3 svMisc_1.1.0 rplos_0.8.6
[6] aRxiv_0.5.19 rcrossref_0.9.2 fulltext_1.4.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.3 pdftools_2.3 utf8_1.1.4 assertthat_0.2.1
[5] digest_0.6.25 mime_0.9 R6_2.4.1 plyr_1.8.5
[9] httr_1.4.1 ggplot2_3.2.1 pillar_1.4.3 rlang_0.4.4
[13] lazyeval_0.2.2 curl_4.3 rstudioapi_0.11 miniUI_0.1.1.1
[17] whisker_0.4 rentrez_1.2.2 DT_0.12 qpdf_1.1
[21] urltools_1.7.3 stringr_1.4.0 htmlwidgets_1.5.1 triebeard_0.3.0
[25] munsell_0.5.0 shiny_1.4.0 compiler_3.6.1 httpuv_1.5.2
[29] pkgconfig_2.0.3 askpass_1.1 htmltools_0.4.0 tidyselect_1.0.0
[33] tibble_2.1.3 solrium_1.1.4 httpcode_0.2.0 microdemic_0.5.0
[37] fansi_0.4.1 crayon_1.3.4 dplyr_0.8.4 hoardr_0.5.2
[41] later_1.0.0 rappdirs_0.3.1 crul_0.9.0 grid_3.6.1
[45] jsonlite_1.6.1 xtable_1.8-4 gtable_0.3.0 lifecycle_0.1.0
[49] magrittr_1.5 storr_1.2.1 scales_1.1.0 bibtex_0.4.2.2
[53] cli_2.0.2 stringi_1.4.6 reshape2_1.4.3 fauxpas_0.2.0
[57] promises_1.1.0 xml2_1.2.2 vctrs_0.2.3 tools_3.6.1
[61] glue_1.3.1 purrr_0.3.3 fastmap_1.0.1 colorspace_1.4-1
This does work:
l <- crm_links("10.2903/j.efsa.2016.4556",type="all")
crm_pdf(l)
while this not:
l <- crm_links("10.2903/j.efsa.2014.3550",type="all")
crm_pdf(l)
The root cause for the error, is that crossref API does return content-type 'unspecified' for
the second case.
"message" : {
"link" : [
{
"URL" : "https://api.wiley.com/onlinelibrary/tdm/v1/articles/10.2903%2Fj.efsa.2014.3550",
"content-version" : "vor",
"content-type" : "unspecified",
"intended-application" : "text-mining"
}
],
I can get it working by overring manualy the content type, like this:
l <- setNames(l, "pdf")
attr(l, "type") <- "pdf"
text <- crm_text(l, type = "pdf")
but this is of course a hack.
After investigation the code, I think there is a inconsitency between crm_links() which returns a mime-type 'unspecified' and the crm_text method, which cannot handle 'unspecified'.
I think crm_text() should be changed to be able to handle 'unspecified' and just plainly download the file and write it to disk, such in the same way how a download with 'curl' does it. Using curl the problem of mime-type = unspecified does not stop me from downloading.
I will take a look at this and send a PR, if I find a solution
Thank you for the library - its really great.
Does it allow the download of research papers from Sage Journals?
According to the Sage Journals Text and Data Mining policy, it is recommended to download articles through the CrossRef Text and Data Mining API.
I was therefore hoping to pass the DOI for papers from SAGE Journals, but unsure if/how to insert or obtain an authentication token?
Kind Regards
skip and/or use vcr for http request calls
fix in webmockr and vcr coming in their next versions, wait for those next versions to be up
Use case from email.
User gave examples of DOIs for journal they access to, and can acces the PDFs in the browser, but via API calls can not access full text. The non-accessible via API DOIs appear to be all inthe range 1993-2003. here's 5 example DOIs for this scenario
The PDFs for these DOIs do exist, but as far as I can tell there;s no way to figure out the URLs for those PDFs
crminer_0.3.5.91
> library(crminer)
> link <- crm_links("10.1086/250113")
> link
$unspecified
<url> http://www.journals.uchicago.edu/doi/pdf/10.1086/250113
> ft <- crm_text(link, "pdf", overwrite_unspecified = T)
using cached file: /Users/xiefangzhou/Library/Caches/R/crminer/250113.pdf
date created (size, mb): 2020-06-12 22:59:56 (0)
Extracting text from pdf...
Error in poppler_pdf_info(loadfile(pdf), opw, upw) : PDF parsing failure.
Sorry for posting this, as this is clearly similar to #41 here and others, but this time it happens for U Chicago Press. The full-text link can be copied and pasted to a web browser and opened as a PDF file.
use case in fulltext
where it'd be nice to be able to do so.
Hello,
I'm trying to download a list of all journal articles from the Wiley journal 'Global Ecology & Biogeography' for some subsequent text mining analysis. My current aim is to download the PDFs for the DOIs. I believe I've managed to get a list of DOIs using the following code:
library(rcrossref)
library(crminer)
library(purrr)
library(dplyr)
geb_papers <- cr_works(filter = c(issn='1466-822X'))
n_pages <- ceiling(geb_papers$meta$total_results / geb_papers$meta$items_per_page)
geb_dois<- map_dfr(1:n_pages, function(x) {
# get the start record for the page
strt <- (x - 1)*20 + 1
dois <- cr_works(filter = c(issn='1466-8238'), offset = strt)$data %>%
filter(type == "journal-article")
return(dois)
})
dois<-geb_dois$doi
So far, so good (I think). The next stage is to then use crm_links() on the list of DOIs like so:
links<-lapply(dois, crm_links, type="pdf")
If I query the object links
I notice that there is no PDF link for thie first item in the list. If we trace back the DOI for this article we get:
dois[[1]]
[1] "10.1111/geb.12139"
Stick that DOI into a browser and you'll be taken to the article here. So there's something there and, in theory, should be downloadable, correct?
Any ideas what may be wrong? The above code should be reproducible so I was hoping some kind person would be able to spare 5 minutes to run it on a subset of the data (perhaps the first 10 entries to speed things up) to see whether they hit the same issue. Out of the first 20 elements in the list links
I have 9 articles missing PDF links, which is quite a high percentage.
To confirm, I have an auth token for crossref in .Renviron, so I don't think it's an authentication issue. This is also backed up with the fact that the second element in links
has a PDF link and can be downloaded when using crm_pdf(links[[2]])
.
Thanks in advance!
Simon
Probably related to many recent changes in crm_text and crm_pdf, but haven't been able to sort out whats going on. Seems fine when commenting out the vcr usage though, so maybe something to do with file caching/writing to disk.
Using the Click-Through Service
Some publishers will require you to use the CrossRef click-through service. This allows you to agree to supplementary licenses. For more information see the Click-Through Service documentation. When you use the click-through service you will be given a token. You should supply this as a header when you make the query to full-text. Here is an example request using a click-through service token:
curl -H "CR-TDM-Client-Token: hZqJDbcbKSSRgRG_PJxSBAx" -k "https://annalsofpsychoceramics.labs.crossref.org/fulltext/515151" -D - -L -O
not sure what i was originally trying to say here, i guess just make sure to support click through as well as possible
This package needs more documentation! Help out the community by contributing a vignette. If you don't know what a vignette is, check out http://r-pkgs.had.co.nz/vignettes.html for an introduction.
If you aren't sure how to contribute on github checkout https://github.com/ropensci/crminer/blob/master/.github/CONTRIBUTING.md
Keep in mind our code of conduct https://github.com/ropensci/crminer/blob/master/CONDUCT.md
in attributes
in case it's needed downstream
"The user shouldn't be able to pass in unspecified
, but some URLs passed in as a result of crm_links
will have unspecified
, but the type
parameter is meant to say which type you want if there are many options (e.g, xml if there's plain and xml), or to override the unspecified
type (e.g., you know the link is for pdf, so put type = "pdf"
AND overwrite_unspecified = TRUE
)"
> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.4
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
Random number generation:
RNG: Mersenne-Twister
Normal: Inversion
Sample: Rounding
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.6.3 tools_3.6.3
Sorry for my repeated posting issues. This time I am working on journals from Oxford Press.
> l <- crm_links("10.1093/icc/4.1.1-a")
> l
$unspecified
<url> http://academic.oup.com/icc/article-pdf/4/1/1/6768751/4-1-1b.pdf
> crm_text(l, "pdf", overwrite_unspecified = T)
Downloading pdf...
Error in curl::curl_fetch_disk(x$url$url, x$disk, handle = x$url$handle) :
Recv failure: Operation timed out
I can confirm that I can open this link in a browser, but calling crm_text()
function will throw timeout error. I tried to use curl -o
in the terminal but was having the same timeout error.
I then tried to run RSelenium
browser and fetch that full-text link. It displayed the article (in PDF) properly in the automated chromedriver
.
library(RSelenium)
browser <- remoteDriver(port = 5556, browserName="chrome")
browser$open()
browser$navigate( as.character(l$unspecified))
I think that their server has some JavaScript testing, and the curl-based HTTP requests will fail. (I am not very familiar with this in R, but I guess it is the same as the Python "requests" package that they cannot deal with dynamic-rendered elements.) I believe the current work-around would be using RSelenium
, download the PDF, and then extract plain text from it.
I wonder if there are better methods to deal with this without using Selenium?
DOI <- c("10.1007/S10531-017-1376-Y","10.1002/ECS2.1309","10.1614/IPSM-D-14-00048.1","10.1890/14-0922.1","10.1093/AOBPLA/PLU081","10.1007/S10530-014-0705-2","10.2111/REM-D-13-00140.1")
links <- sapply(DOI, crminer::crm_links)
Above is a list of DOIs, some of which are from Wiley. crminer
will generate links from the Wiley DOIs, but the links labeled as $.pdf are invalid. In some cases crminer
generates valid links labeled as unspecified
but in some cases it doesn't and I can't figure out enough of a pattern to exploit that usefully.
R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.4
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
Random number generation:
RNG: Mersenne-Twister
Normal: Inversion
Sample: Rounding
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] crminer_0.3.3.93
loaded via a namespace (and not attached):
[1] hoardr_0.5.2 compiler_3.6.3 R6_2.4.1 tools_3.6.3 httpcode_0.3.0 curl_4.3
[7] rappdirs_0.3.1 Rcpp_1.0.4.6 urltools_1.7.3 pdftools_2.3 triebeard_0.3.0 crul_0.9.0
[13] qpdf_1.1 jsonlite_1.6.1 digest_0.6.25 askpass_1.1
> doi <- "10.1017/s0081305200012255"
> link <- crm_links(doi)
> crm_text(link)
Error in crm_text.list(link) : no links for type xml
> link
$unspecified
<url> https://www.cambridge.org/core/services/aop-cambridge-core/content/view/S0081305200012255
> crm_text(link, "pdf", overwrite_unspecified = T)
Downloading pdf...
Extracting text from pdf...
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't read xref table
Error in poppler_pdf_info(loadfile(pdf), opw, upw) : PDF parsing failure.
and possibly others, e.g,
url <- "https://api.elsevier.com/content/article/PII:S0370269310012608?httpAccept=text/plain"
crm_plain(url)
#> [1] "<service-error><status><statusCode>INVALID_INPUT</statusCode><statusText>View parameter specified in request is not valid</statusText></status></service-error>"
im guessing its due to missing attributes (probably doi and crossref member in particular), see
link <- crm_links("10.1016/j.physletb.2010.10.049", "plain")
z <- as_tdmurl(url, "plain")
str(link)
#> List of 1
#> $ plain: chr "https://api.elsevier.com/content/article/PII:S0370269310012608?httpAccept=text/plain"
#> - attr(*, "class")= chr "tdmurl"
#> - attr(*, "type")= chr "plain"
#> - attr(*, "doi")= chr "10.1016/j.physletb.2010.10.049"
#> - attr(*, "member")= chr "78"
#> - attr(*, "intended_application")= chr "text-mining"
str(z)
#> List of 1
#> $ plain: chr "https://api.elsevier.com/content/article/PII:S0370269310012608?httpAccept=text/plain"
#> - attr(*, "class")= chr "tdmurl"
#> - attr(*, "type")= chr "plain"
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.