mjwestgate / synthesisr Goto Github PK

Data import and deduplication for evidence synthesis projects

R 100.00%

synthesisr's Introduction

Martin Westgate

Hi! I'm Martin. I lead the Science & Decision Support Team at the Atlas of Living Australia. You can find out more about my work at martinwestgate.com.

synthesisr's People

Contributors

Stargazers

Watchers

Forkers

pythseq dhofstetter benjaminschwetz fonknp tharaka18 matherion davan690 chriscpritchard jbarsotti

synthesisr's Issues

Error with `read_refs()` when trying to parse an APA PsycNet RIS file

I encountered an error while trying to parse an exported RIS file from APA PsycNet using read_refs().

Error in end_rows[1:(length(end_rows) - 1)] : 
  error only 0's can be mixed with negative subscriptions

Tip: don't use 1:(length(end_rows) - 1) is a bad practice. Use, for example, seq_len(length(end_rows) - 1) instead.

This error occurs with the function synthesisr:::prep_ris() . The function incorrectly put the ER (end_row) in the wrong column of the z_dframe object. If you debug this function and set the correct index for the end_row all goes well.

I don't have the time to dig deeper into this problem right now. I will try to do this later.

vectorization in fuzz_m_ratio()

The documentation for fuzz_m_ratio() says:

Arguments
a A character vector of items to match to b.

b A character vector of items to match to a.

method The method to use for fuzzy matching.

Value
Returns a score of same length as b, giving the proportional dissimilarity between a and b.

Problem

I am not sure how to interpret the fact that a is permitted to be a character vector with more than a single item.

I guess that the most common use case of the fuzz_ functions is with a as a single item. In this case, fuzz_m_ratio() works as expected and recycles a over all items in b:

b <- c("a", "b", "ab")
fuzz_m_ratio("a", b)

[1] 0.0000000 1.0000000 0.3333333

According to the standard R vectorization rules (see for example R for Data Science), if a is a vector with the same number of items as b, then the items from a and b should be paired up and the result should be the pairwise dissimilarity.

But this isn't what happens. For example:

fuzz_m_ratio(c("a", "b", "ab"), b)

[1] 1 1 1

Here, each item in a is identical to the corresponding item in b so we should get 0 0 0 dissimilarity.

It isn't the case that the function is simply reporting similarity instead of dissimilarity here. For example:

fuzz_m_ratio(c("a", "b", "x"), b)

[1] 1 1 1

Here some items from a match their counterparts in b and some don't, so we should get both 0s and 1s in the answer.

It looks as though the same inconsistencies occur for at least some of the other fuzz_ functions too.

Suggestions

I am not sure what is intended for the case of multiple items in a, and from the source code I can't quite work out what is actually being done. Possibly the function is only intended for single-item a. If this is the case, then maybe the documentation needs amending, and perhaps also an error or warning message if a has multiple items, since some R users will probably expect the standard vectorization rules.

On the other hand, if vectorization is wanted here, then maybe one general solution is to first create a non-vectorized version of the function that calculates the distance for a single pair of strings, for example:

fmr_nonvectorized <- function(str1, str2){
  lengths <- nchar(c(str1, str2))
  len_shortest <- min(lengths)
  char_matches <- unlist(strsplit(str1, ""))[1:len_shortest] == unlist(strsplit(str2, ""))[1:len_shortest]
  similarity <- 2 * sum(char_matches) / sum(lengths)
  return(1 - similarity)
}

fmr_nonvectorized("a", "ab")

[1] 0.3333333

And then apply it to two vectors using mapply(), for example:

fmr_vectorized <- function(a, b){
  distances <- mapply(fmr_nonvectorized, a, b, USE.NAMES = FALSE)
  return(distances)
}

This gives the standard R-style vectorization rules for the examples above:

fmr_vectorized("a", b)

[1] 0.0000000 1.0000000 0.3333333

fmr_vectorized(c("a", "b", "ab"), b)

[1] 0 0 0

fmr_vectorized(c("a", "b", "x"), b)

[1] 0 0 1

fmr_vectorized(c("x", "x", "x"), b)

[1] 1 1 1

And as a bonus this will give the standard warning if the longer of a and b is not an integer multiple of the other:

fmr_vectorized(c("a", "b"), b)

[1] 0.0000000 0.0000000 0.3333333
Warning message:
In mapply(fmr_nonvectorized, a, b, USE.NAMES = FALSE) :
  longer argument not a multiple of length of shorter

Import issue (due to Sys.setlocale)

If I import the following RIS file with read_ref(), authors are returned as the first column - which then breaks write_refs(), where authors are written first, so that re-import (at least with read_ref) fails.

library(synthesisr)

tmp <- tempfile()

download.file("https://raw.githubusercontent.com/ESHackathon/CiteSource/main/tests/testthat/data/final.ris", tmp)

citations <- read_ref(tmp, return_df = TRUE)
write_refs(citations, file = "test-export.ris")
citations <- read_ref("test-export.ris", return_df = TRUE)
#> Error in data.frame(start = which(z_dframe$ris == start_tag), end = end_rows): arguments imply differing number of rows: 101, 17

After quite a lot of troubleshooting, I realized that this is because of Sys.setlocale("LC_ALL", "C") - if that is set, the first characters of the file are read as \357 \273 \277 so that the TY in that row is no longer recognised. Given that this breaks everything, I wonder whether it would be worth stripping special characters there? Or not using Sys.setlocale at all ... not sure why it is needed and thus if there is a safer workaround.

Also, as it currently stands, the function silently changes Sys.setlocate if it has been customised before - which is not good practice (arguably against CRAN's guidance not to modify the global environment). So the following might be better - though it needs to be set for each require locale type separately?

   old_loc <- Sys.getlocale("LC_CTYPE")
   invisible(Sys.setlocale("LC_CTYPE", "C"))
    on.exit(invisible(Sys.setlocale("LC_CTYPE", old_loc)))

read_refs() introducing 'and' in author lists between first and last names

read_refs() is introducing unwarranted 'and' between first and last names of authors.

read_refs appears to duplicate records when reading in

Hi Martin,

Having some odd behaviour reading in a bibliography file (.txt exported with ris-format from EndNote) with synthesisr:read_refs compared to revtools:read_bibliography. The file I am reading in has 729 records and reads fine with revtools:read_bibliography but synthesisr:read_refs adds a duplicate record, making it 730. See R code + outputs below:

EnvironmentComplete_revtools<-revtools::read_bibliography('3_searching/4_search_outputs/main_searches/raw/EnvironmentComplete_300920.txt')
nrow(EnvironmentComplete_revtools)
[1] 729
EnvironmentComplete_synthesisr<-synthesisr::read_refs('3_searching/4_search_outputs/main_searches/raw/EnvironmentComplete_300920.txt')
nrow(EnvironmentComplete_synthesisr)
[1] 730

`
In this example, synthesisr:read_refs is adding 1 record, but for other bib files it can be more. I attach the file to see if you can figure it out! Good luck, so long and thanks for all the packages!

Cheers
Matt

EnvironmentComplete_300920.txt

`read_refs()` doesn't load all RIS files properly

For a systematic review (duh 😬) we're loading RIS files exported from:

ERIC (through its own interface, if I remember correctly)
PubMed (definitely through its own interface)
PsycINFO (through EBSCO)
Bielefeld Academic Search Engine (its own interface)
OpenGrey (through Exalead)
OAIster (through WorldCat)
Cinahl (through EBSCO)
Embase (through Ovid)
Reference search (through reference lists and Google Scholar)

These work fine for the first few, but the fields aren't imported properly from Cinahl and Embase. I can't figure out why - the Cinahl file, for example, seems pretty straightforward RIS (https://gitlab.com/extending-the-earcheck/living-review/-/blob/master/queries/literature_search_02/CINAHL_Ebsco_N236.ris), e.g.:

TY  - JOUR
ID  - 147887838
T1  - Hearing Problems Among the Members of the Defence Forces in Relation to Personal and Occupational Risk Factors.
AU  - Luha, Assar
AU  - Kaart, Tanel
AU  - Merisalu, Eda
AU  - Indermitte, Ene
AU  - Orru, Hans
Y1  - 2020/11//Nov/Dec2020
N1  - Accession Number: 147887838. Language: English. Entry Date: In Process. Revision Date: 20210107. Publication Type: journal article; research. Journal Subset: Biomedical; Expert Peer Reviewed; Peer Reviewed; USA. NLM UID: 2984771R. 
SP  - e2115
EP  - e2123
JO  - Military Medicine
JF  - Military Medicine
JA  - MILIT MED
VL  - 185
IS  - 11/12
PB  - Oxford University Press / USA
AB  - Introduction: The Defence Forces' members are exposed to high-level noise that increases their risk of hearing loss (HL). Besides military noise, the other risk factors include age and gender, ototoxic chemicals, vibration, and chronic stress. The current study was designed to study the effects of personal, work conditions-related risk factors, and other health-related traits on the presence of hearing problems.Materials and Methods: A cross-sectional study among active military service members was carried out. Altogether, 807 respondents completed a questionnaire about their health and personal and work-related risk factors in indoor and outdoor environments. The statistical analysis was performed using statistical package of social sciences (descriptive statistics) and R (correlation and regression analysis) software.Results: Almost half of the active service members reported HL during their service period. The most important risk factors predicting HL in the military appeared to be age, gender, and service duration. Also, working in a noisy environment with exposure to technological, vehicle, and impulse noise shows a statistically significant effect on hearing health. Moreover, we could identify the effect of stress on tinnitus and HL during the service period. Most importantly, active service members not using hearing protectors, tend to have more tinnitus than those who use it.Conclusions: The members of the Defence Forces experience noise from various sources, most of it resulting from outdoor activities. Personal and work conditions-related risk factors as well as stress increase the risk of hearing problems.
SN  - 0026-4075
AD  - Institute of Technology , Estonian University of Life Sciences, Kreutzwaldi 56/1, Tartu 51006, Estonia
AD  - Institute of Veterinary Medicine and Animal Sciences , Estonian University of Life Sciences , Kreutzwaldi 62, Tartu 51006, Estonia
AD  - Institute of Family Medicine and Public Health , University of Tartu , Ravila 19, Tartu 50411, Estonia
U2  - PMID: NLM32879984.
DO  - 10.1093/milmed/usaa224
UR  - http://login.ezproxy.ub.unimaas.nl/login?url=https://search.ebscohost.com/login.aspx?direct=true&db=cin20&AN=147887838&site=ehost-live&scope=site
DP  - EBSCOhost
DB  - cin20
ER  -

But if I then run:

refs <- synthesisr::read_refs(here::here("queries", "literature_search_02", "CINAHL_Ebsco_N236.ris"));
names(refs);

It shows:

 [1] "author"         "address"        "date_published" "issue"          "abstract"      
 [6] "DB"             "DO"             "EP"             "ID"             "JA"            
[11] "JF"             "JO"             "PB"             "SN"             "SP"            
[16] "TY"             "UR"             "VL"             "KW"             "CY"            
[21] "AV"

So it doesn't recorgnize the T1 field as title - but it does drop it from the data frame for some reason. I hope to figure out how the import functions were designed exactly so I can debug this myself (and submit a pull request), but I'm not sure I'll manage, and and also posting this here in case others run into similar problems.

Import breaks when bibtex field is empty

Currently, empty bibtex fields break the import - for instance, when Crossref returns an empty author field.

library(synthesisr)
bib <- "@article{68,
  author = {},
  title = {Evaluation of technology transferring: The experiences of the first Navy Domestic Technology Transfair. Final report},
  journal = {Reviews},
  publisher = {Office of Scientific and Technical Information (OSTI)},
  date = {2003},
  year = {2003},
  address = {Arlington, VA},
  doi = {10.2172/10138039}
}"

t <- tempfile(fileext = ".bib")
writeLines(bib, t)
res <- read_ref(t)
#> Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 1, 0

^{Created on 2023-01-26 with reprex v2.0.2}

read_refs() incorrectly splitting abstract over multiple fields

read_refs() is incorrectly splitting the abstract in this record across multiple fields:
10.1111/j.1469-8137.2004.01201.x

Push update to CRAN? (file incorrectly read as pubmed in CRAN version)

TLDR: Any chance that the current synthesisr version could be published to CRAN soon?

Why?
In developing the CiteSource package that relies on synthesisr, we stumbled over a Dimensions file that is wrongly read as PubMed. After a fair bit of troubleshooting, I realised that this is already solved in the Github version - but it would be great if we could just rely on the CRAN version ...

Single-line bibtex entries break import

parse_bibtex appears to attempt to skip entries with fewer than 4 lines. However, it does not do so successfully, so that the following import breaks. Possibly, it should if such entries are deemed to be invalid - but then an explicit error message would be helpful.

library(synthesisr)

bib <- "
@article{5,
  author = {Altman, Irwin and Haythorn, William},
  title = {The Effects of Social Isolation and Group Composition on Performance},
  journal = {Human Relations},
  publisher = {SAGE Publications},
  date = {1967-11},
  year = {1967},
  month = {11},
  pages = {313-340},
  volume = {20},
  number = {4},
  doi = {10.1177/001872676702000401}
}

@misc{6,
  author = {Williams, Katherine and Reilly, Charles and Ill}
}"

t <- tempfile(fileext = ".bib")
writeLines(bib, t)

res <- read_ref(t)
#> Error in names(x_final) <- ref_names: 'names' attribute [2] must be the same length as the vector [1]

^{Created on 2023-01-26 with reprex v2.0.2}

add functions from current versions of revtools and litsearchr

read_refs() failing to read in first field of first record

The first record isn't being read in properly - TY-JOUR is being shown in a 'ZZ' field instead of being recognised as source_type. I think this is because few RIS files have a field code signifying the start of the file, 'TY - ' is the start of the text file.

See this example from EMBASE: https://gitlab.com/extending-the-earcheck/living-review/-/blob/master/search/literature_search_02/Embase_290521_N974.RIS?expanded=true&viewer=simple

Accept custom data.frame in write_refs

It would be great to accept a custom data.frame for write_refs ... or otherwise allow the user to change the tag naming.

It looks like the custom dataframe was supposed to be implemented (see below) - happy to submit a PR if you would like to allow this to be passed on to write_ris?

synthesisr/R/write_refs.R

Lines 146 to 149 in c406bc9

    
           }else if(inherits(tag_naming, "data.frame")){ 
        
             if(any(!(c("code", "field") %in% colnames(tag_naming)))){ 
        
               stop("if a data.frame is supplied to replace_tags, it must contain columns 'code' & 'field'.") 
        
             }

read_refs from different sources throws error

If I take several files (one from scopus, one from WoS, for example), when I use read_refs on them i get an error:

Error in `[.data.frame`(result, , cn) : undefined columns selected

I checked using the same file twice. There is no error in that case.

Check for data.frame class in write_refs() too conservative? (rejects tibbles)

We are looking to use write_refs() in our ESHackathon/CiteSource package. For that, I was passing a tibble to write_refs and wondered why it fails - since it also has the data.frame class. This is because your check for classes depends on their precise position when there is more than one ... instead of

synthesisr/R/write_refs.R

Line 127 in c406bc9

if(!any(c("bibliography", "data.frame") == class(x))) {

it might be better to use

if (inherits(x, "data.frame") || inherits(x, "bibliography"))

That would allow tibbles and generally be more robust ...

detect_database() function: expected_columns not found

When running synthesisr (via litsearchr), importing files results in the error:

Error in all.equal(colnames(df), expected_columns) : object 'expected_columns' not found

I think this might be because of the recent changes to recognize preformatted files.

On line 144 of import_functions.R:
all.equal(colnames(df), expected_columns)

I think because expected_columns isn't defined anywhere there, it is throwing an error.

I am an R noob so I'm not sure what is going on, but when I am exploring the package Environment (in RStudio), I can see the expected_columns variable does have all the values listed in it. I'm just not sure why the function can't access that when it runs.

Thanks again for your amazing packages!

	}else if(inherits(tag_naming, "data.frame")){
	if(any(!(c("code", "field") %in% colnames(tag_naming)))){
	stop("if a data.frame is supplied to replace_tags, it must contain columns 'code' & 'field'.")
	}