Coder Social home page Coder Social logo

crminer's People

Contributors

behrica avatar coatless avatar graceli8 avatar maelle avatar njahn82 avatar salim-b avatar sckott avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

crminer's Issues

pensoft - http->https

for crm_links, need to change http to https in links from pensoft. crossref API has http links.

note: having ssl cert issues with requests to these links though, may be temporary

wiley full text issue

Session Info
> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.4

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 
 
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.8.5        rcrossref_1.0.0.91 crminer_0.3.2.91  

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6      plyr_1.8.6        pillar_1.4.3      compiler_3.6.3    later_1.0.0      
 [6] pdftools_2.3      tools_3.6.3       digest_0.6.25     jsonlite_1.6.1    lifecycle_0.2.0  
[11] tibble_3.0.1      pkgconfig_2.0.3   rlang_0.4.5       bibtex_0.4.2.2    shiny_1.4.0.2    
[16] rstudioapi_0.11   curl_4.3          crul_0.9.0        fastmap_1.0.1     xml2_1.3.2       
[21] stringr_1.4.0     vctrs_0.2.4       htmlwidgets_1.5.1 askpass_1.1       rappdirs_0.3.1   
[26] triebeard_0.3.0   DT_0.13           tidyselect_1.0.0  httpcode_0.3.0    glue_1.4.0       
[31] qpdf_1.1          R6_2.4.1          tidyr_1.0.2       purrr_0.3.4       hoardr_0.5.2     
[36] magrittr_1.5      urltools_1.7.3    promises_1.1.0    ellipsis_0.3.0    htmltools_0.4.0  
[41] assertthat_0.2.1  mime_0.9          xtable_1.8-4      httpuv_1.5.2      stringi_1.4.6    
[46] miniUI_0.1.1.1    crayon_1.3.4

In the following example, it seems that I cannot extract the content from PDF for some Wiley articles.

> doi <- "10.1111/1477-9552.12353"
> l <- crm_links(doi)
$pdf
<url> https://onlinelibrary.wiley.com/doi/pdf/10.1111/1477-9552.12353

$xml
<url> https://onlinelibrary.wiley.com/doi/full-xml/10.1111/1477-9552.12353

$unspecified
<url> https://onlinelibrary.wiley.com/doi/pdf/10.1111/1477-9552.12353

> crm_pdf(l, overwrite_unspecified = T)
Downloading pdf...
Extracting text from pdf...
PDF error: May not be a PDF file (continuing anyway)
PDF error (6): Illegal character <21> in hex string
PDF error (8): Illegal character <4f> in hex string
.........
PDF error (588): Illegal character <22> in hex string
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't read xref table
Error in poppler_pdf_info(loadfile(pdf), opw, upw) : PDF parsing failure.

My understanding of this error message is that the download was completed and the crm_text() (crm_pdf()) function moves on to extract its content and encounters some error there. I have tried both pdf and unspecified link (they are essentially the same link) and both give the same error message.

Just for some further tests, I copied and pasted the link into web browser and downloaded the PDF. And then I ran

pdftools::pdf_text("~/Downloads/1477-9552.12353.pdf")

This gave me the correct results.

crm_text/etc fails for some Wiley DOIs

inside of cr_auth, we need to fix - if a DOI owner is a different publisher than the article for that DOI lives at , then we run into a problem because cr_auth only handles specific publishers.

e.g. 10.1111/j.1468-0297.1997.tb00019.x - the crossref member is Cambridge, but the article for that DOI lives at Wiley

elsevier first page thing

when you don't have access to a paper, at least for some papers, elsevier gives a 200 response and returns the first page only, and gives header:

< X-ELS-Status: WARNING - Response limited to first page because requestor not entitled to resource
  • crminer should look for this header and warn the user about this
  • should we keep the 1st page and leave the request as a successful request, or detect the header and fail out and remove the file with the 1st page?

any thoughts @mark-fangzhou-xie ?

Crossref TDM click through service going away end of 2020

Transition away from using click through keys.

Maybe add a warning/message telling users that the click through keys will be useless 2021 and beyond, but ideally need to be able to say what they can use instead.

In 2021, remove support for keys, and then error if a user uses a click through key

make_file_path internal fxn: handle weird urls

an example we have in the docs

dois_crminer_ccby3[40]
#> "10.1088/1742-6596/689/1/012019"
links <- crm_links(dois_crminer_ccby3[40], "all")
# crm_text(links, 'pdf')

example of calling crm_text is commented out cause it 404s, which is fine, but the file is not deleted on exit as it should be.

the url it gives is http://stacks.iop.org/1742-6596/689/i=1/a=012019/pdf - which causes a problem in the internal fxn make_file_path() which results in a filename of pdf

crm_text - Specifying absolute path and file names for PDFs

Hi @sckott,

Thank you for the great package! I was wondering if there is a way to:

  • save the downloaded pdfs outside the cache location
  • rename the PDFs

If not, would you recommend to usecrm_cache$list() to keep the link between the reference information and the downloaded filenames?

Thanks!
Julien

tidy up crm_text/etc. fxns

Should ideally handle URL as string or output of crm_links - but doesn't really handle URL as string right now,

use s3 methods for this, also will prevent esaily passing in other classes with good message

curl problem

I have this type of error when trying to get full-text data.
this is my command:

crm_text(crm_links('10.1016/s1090-9516(97)90007-9'))

And this is my error:

Error in curl::curl_fetch_memory(x$url$url, handle = x$url$handle) :
Protocol "httpss" not supported or disabled in libcurl

This is my sessions Info:

sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] crminer_0.2.0 lubridate_1.7.4 XML_3.99-0.3 svMisc_1.1.0 rplos_0.8.6
[6] aRxiv_0.5.19 rcrossref_0.9.2 fulltext_1.4.0

loaded via a namespace (and not attached):
[1] Rcpp_1.0.3 pdftools_2.3 utf8_1.1.4 assertthat_0.2.1
[5] digest_0.6.25 mime_0.9 R6_2.4.1 plyr_1.8.5
[9] httr_1.4.1 ggplot2_3.2.1 pillar_1.4.3 rlang_0.4.4
[13] lazyeval_0.2.2 curl_4.3 rstudioapi_0.11 miniUI_0.1.1.1
[17] whisker_0.4 rentrez_1.2.2 DT_0.12 qpdf_1.1
[21] urltools_1.7.3 stringr_1.4.0 htmlwidgets_1.5.1 triebeard_0.3.0
[25] munsell_0.5.0 shiny_1.4.0 compiler_3.6.1 httpuv_1.5.2
[29] pkgconfig_2.0.3 askpass_1.1 htmltools_0.4.0 tidyselect_1.0.0
[33] tibble_2.1.3 solrium_1.1.4 httpcode_0.2.0 microdemic_0.5.0
[37] fansi_0.4.1 crayon_1.3.4 dplyr_0.8.4 hoardr_0.5.2
[41] later_1.0.0 rappdirs_0.3.1 crul_0.9.0 grid_3.6.1
[45] jsonlite_1.6.1 xtable_1.8-4 gtable_0.3.0 lifecycle_0.1.0
[49] magrittr_1.5 storr_1.2.1 scales_1.1.0 bibtex_0.4.2.2
[53] cli_2.0.2 stringi_1.4.6 reshape2_1.4.3 fauxpas_0.2.0
[57] promises_1.1.0 xml2_1.2.2 vctrs_0.2.3 tools_3.6.1
[61] glue_1.3.1 purrr_0.3.3 fastmap_1.0.1 colorspace_1.4-1

Crossref API reports for some Wiley full text pdf's the mimetype 'unspecified', and therefore crm_pdf fails

This does work:

l <- crm_links("10.2903/j.efsa.2016.4556",type="all")
crm_pdf(l)

while this not:

l <- crm_links("10.2903/j.efsa.2014.3550",type="all")
crm_pdf(l)

The root cause for the error, is that crossref API does return content-type 'unspecified' for
the second case.

"message" : {
      "link" : [
         {
            "URL" : "https://api.wiley.com/onlinelibrary/tdm/v1/articles/10.2903%2Fj.efsa.2014.3550",
            "content-version" : "vor",
            "content-type" : "unspecified",
            "intended-application" : "text-mining"
         }
      ],

I can get it working by overring manualy the content type, like this:

l <- setNames(l, "pdf")
  attr(l, "type") <- "pdf"
  text <- crm_text(l, type = "pdf")

but this is of course a hack.

After investigation the code, I think there is a inconsitency between crm_links() which returns a mime-type 'unspecified' and the crm_text method, which cannot handle 'unspecified'.

I think crm_text() should be changed to be able to handle 'unspecified' and just plainly download the file and write it to disk, such in the same way how a download with 'curl' does it. Using curl the problem of mime-type = unspecified does not stop me from downloading.

I will take a look at this and send a PR, if I find a solution

Sage Journals

Thank you for the library - its really great.

Does it allow the download of research papers from Sage Journals?

According to the Sage Journals Text and Data Mining policy, it is recommended to download articles through the CrossRef Text and Data Mining API.

I was therefore hoping to pass the DOI for papers from SAGE Journals, but unsure if/how to insert or obtain an authentication token?

Kind Regards

fix tests

skip and/or use vcr for http request calls

elsevier full text issues

Use case from email.

User gave examples of DOIs for journal they access to, and can acces the PDFs in the browser, but via API calls can not access full text. The non-accessible via API DOIs appear to be all inthe range 1993-2003. here's 5 example DOIs for this scenario

  • 10.1006/jeth.1993.1066
  • 10.1006/jeth.1994.1072
  • 10.1006/jeth.1995.1078
  • 10.1006/jeth.1996.0123
  • 10.1006/jeth.1997.2332

The PDFs for these DOIs do exist, but as far as I can tell there;s no way to figure out the URLs for those PDFs

Chicago press full text issue

Session Info
crminer_0.3.5.91
> library(crminer)
> link <- crm_links("10.1086/250113")
> link
$unspecified
<url> http://www.journals.uchicago.edu/doi/pdf/10.1086/250113

> ft <- crm_text(link, "pdf", overwrite_unspecified = T)
using cached file: /Users/xiefangzhou/Library/Caches/R/crminer/250113.pdf
date created (size, mb): 2020-06-12 22:59:56 (0)
Extracting text from pdf...
Error in poppler_pdf_info(loadfile(pdf), opw, upw) : PDF parsing failure.

Sorry for posting this, as this is clearly similar to #41 here and others, but this time it happens for U Chicago Press. The full-text link can be copied and pasted to a web browser and opened as a PDF file.

PDFs from Wiley

Hello,

I'm trying to download a list of all journal articles from the Wiley journal 'Global Ecology & Biogeography' for some subsequent text mining analysis. My current aim is to download the PDFs for the DOIs. I believe I've managed to get a list of DOIs using the following code:

library(rcrossref)
library(crminer)
library(purrr)
library(dplyr)

geb_papers <- cr_works(filter = c(issn='1466-822X')) 

n_pages <- ceiling(geb_papers$meta$total_results / geb_papers$meta$items_per_page)

geb_dois<- map_dfr(1:n_pages, function(x) {
  # get the start record for the page
  strt <- (x - 1)*20 + 1
  dois <- cr_works(filter = c(issn='1466-8238'), offset = strt)$data %>% 
    filter(type == "journal-article")
  return(dois)
})

dois<-geb_dois$doi

So far, so good (I think). The next stage is to then use crm_links() on the list of DOIs like so:

links<-lapply(dois, crm_links, type="pdf")

If I query the object links I notice that there is no PDF link for thie first item in the list. If we trace back the DOI for this article we get:

dois[[1]]
[1] "10.1111/geb.12139"

Stick that DOI into a browser and you'll be taken to the article here. So there's something there and, in theory, should be downloadable, correct?

Any ideas what may be wrong? The above code should be reproducible so I was hoping some kind person would be able to spare 5 minutes to run it on a subset of the data (perhaps the first 10 entries to speed things up) to see whether they hit the same issue. Out of the first 20 elements in the list links I have 9 articles missing PDF links, which is quite a high percentage.

To confirm, I have an auth token for crossref in .Renviron, so I don't think it's an authentication issue. This is also backed up with the fact that the second element in links has a PDF link and can be downloaded when using crm_pdf(links[[2]]).

Thanks in advance!
Simon

Some crm_pdf/cr_text tests failing

Probably related to many recent changes in crm_text and crm_pdf, but haven't been able to sort out whats going on. Seems fine when commenting out the vcr usage though, so maybe something to do with file caching/writing to disk.

click through...

Using the Click-Through Service

Some publishers will require you to use the CrossRef click-through service. This allows you to agree to supplementary licenses. For more information see the Click-Through Service documentation. When you use the click-through service you will be given a token. You should supply this as a header when you make the query to full-text. Here is an example request using a click-through service token:

curl -H "CR-TDM-Client-Token: hZqJDbcbKSSRgRG_PJxSBAx" -k "https://annalsofpsychoceramics.labs.crossref.org/fulltext/515151" -D - -L -O

not sure what i was originally trying to say here, i guess just make sure to support click through as well as possible

remove unspecified from type param in crm_text

"The user shouldn't be able to pass in unspecified, but some URLs passed in as a result of crm_links will have unspecified, but the type parameter is meant to say which type you want if there are many options (e.g, xml if there's plain and xml), or to override the unspecified type (e.g., you know the link is for pdf, so put type = "pdf" AND overwrite_unspecified = TRUE)"

oxford full text issue

Session Info
> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.4

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 
 
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.6.3 tools_3.6.3 

Sorry for my repeated posting issues. This time I am working on journals from Oxford Press.

> l <- crm_links("10.1093/icc/4.1.1-a")
> l
$unspecified
<url> http://academic.oup.com/icc/article-pdf/4/1/1/6768751/4-1-1b.pdf
> crm_text(l, "pdf", overwrite_unspecified = T)
Downloading pdf...
Error in curl::curl_fetch_disk(x$url$url, x$disk, handle = x$url$handle) : 
  Recv failure: Operation timed out

I can confirm that I can open this link in a browser, but calling crm_text() function will throw timeout error. I tried to use curl -o in the terminal but was having the same timeout error.

I then tried to run RSelenium browser and fetch that full-text link. It displayed the article (in PDF) properly in the automated chromedriver.

library(RSelenium)
browser <- remoteDriver(port = 5556, browserName="chrome")
browser$open()
browser$navigate( as.character(l$unspecified))

I think that their server has some JavaScript testing, and the curl-based HTTP requests will fail. (I am not very familiar with this in R, but I guess it is the same as the Python "requests" package that they cannot deal with dynamic-rendered elements.) I believe the current work-around would be using RSelenium, download the PDF, and then extract plain text from it.

I wonder if there are better methods to deal with this without using Selenium?

crminer generates invalid link for DOIs from Wiley

DOI <- c("10.1007/S10531-017-1376-Y","10.1002/ECS2.1309","10.1614/IPSM-D-14-00048.1","10.1890/14-0922.1","10.1093/AOBPLA/PLU081","10.1007/S10530-014-0705-2","10.2111/REM-D-13-00140.1")
links <- sapply(DOI, crminer::crm_links)

Above is a list of DOIs, some of which are from Wiley. crminer will generate links from the Wiley DOIs, but the links labeled as $.pdf are invalid. In some cases crminer generates valid links labeled as unspecified but in some cases it doesn't and I can't figure out enough of a pattern to exploit that usefully.

cambridge full text issue

Session Info
R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.4

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 
 
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] crminer_0.3.3.93

loaded via a namespace (and not attached):
 [1] hoardr_0.5.2    compiler_3.6.3  R6_2.4.1        tools_3.6.3     httpcode_0.3.0  curl_4.3       
 [7] rappdirs_0.3.1  Rcpp_1.0.4.6    urltools_1.7.3  pdftools_2.3    triebeard_0.3.0 crul_0.9.0     
[13] qpdf_1.1        jsonlite_1.6.1  digest_0.6.25   askpass_1.1    
> doi <- "10.1017/s0081305200012255"
> link <- crm_links(doi)
> crm_text(link)
Error in crm_text.list(link) : no links for type xml
> link
$unspecified
<url> https://www.cambridge.org/core/services/aop-cambridge-core/content/view/S0081305200012255

> crm_text(link, "pdf", overwrite_unspecified = T)
Downloading pdf...
Extracting text from pdf...
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't read xref table
Error in poppler_pdf_info(loadfile(pdf), opw, upw) : PDF parsing failure.

I think this is connected to #41 , #40 ?

crm_plain doesn't seem to work with url input for elsevier

and possibly others, e.g,

url <- "https://api.elsevier.com/content/article/PII:S0370269310012608?httpAccept=text/plain"
crm_plain(url)
#> [1] "<service-error><status><statusCode>INVALID_INPUT</statusCode><statusText>View parameter specified in request is not valid</statusText></status></service-error>"

im guessing its due to missing attributes (probably doi and crossref member in particular), see

link <- crm_links("10.1016/j.physletb.2010.10.049", "plain")
z <- as_tdmurl(url, "plain")

str(link)
#> List of 1
#>  $ plain: chr "https://api.elsevier.com/content/article/PII:S0370269310012608?httpAccept=text/plain"
#>  - attr(*, "class")= chr "tdmurl"
#>  - attr(*, "type")= chr "plain"
#>  - attr(*, "doi")= chr "10.1016/j.physletb.2010.10.049"
#>  - attr(*, "member")= chr "78"
#>  - attr(*, "intended_application")= chr "text-mining"

str(z)
#> List of 1
#>  $ plain: chr "https://api.elsevier.com/content/article/PII:S0370269310012608?httpAccept=text/plain"
#>  - attr(*, "class")= chr "tdmurl"
#>  - attr(*, "type")= chr "plain"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.