Coder Social home page Coder Social logo

ropensci / openalexr Goto Github PK

View Code? Open in Web Editor NEW
89.0 6.0 19.0 55.85 MB

Getting bibliographic records from OpenAlex

Home Page: https://docs.ropensci.org/openalexR/

License: Other

R 100.00%
bibliographic-data bibliographic-database bibliometrics bibliometrix science-mapping

openalexr's Introduction

openalexR

R-CMD-check Lifecycle: experimental CRAN status Codecov test coverage Status at rOpenSci Software Peer Review

openalexR helps you interface with the OpenAlex API to retrieve bibliographic information about publications, authors, institutions, sources, funders, publishers, topics and concepts with 5 main functions:

  • oa_fetch: composes three functions below so the user can execute everything in one step, i.e., oa_query |> oa_request |> oa2df

  • oa_query: generates a valid query, written following the OpenAlex API syntax, from a set of arguments provided by the user.

  • oa_request: downloads a collection of entities matching the query created by oa_query or manually written by the user, and returns a JSON object in a list format.

  • oa2df: converts the JSON object in classical bibliographic tibble/data frame.

  • oa_random: get random entity, e.g., oa_random("works") gives a different work each time you run it

📜 Citation

If you use openalexR in research, please cite:

Aria, M., Le T., Cuccurullo, C., Belfiore, A. & Choe, J. (2024), openalexR: An R-Tool for Collecting Bibliometric Data from OpenAlex, The R Journal, 15(4), 167-180, DOI: https://doi.org/10.32614/RJ-2023-089.

🙌 Support OpenAlex

If OpenAlex has helped you, consider writing a Testimonial which will help support the OpenAlex team and show that their work is making a real and necessary impact.

⚙️ Setup

You can install the developer version of openalexR from GitHub with:

install.packages("remotes")
remotes::install_github("ropensci/openalexR")

You can install the released version of openalexR from CRAN with:

install.packages("openalexR")

Before we go any further, we highly recommend you set openalexR.mailto option so that your requests go to the polite pool for faster response times. If you have OpenAlex Premium, you can add your API key to the openalexR.apikey option as well. These lines best go into .Rprofile with file.edit("~/.Rprofile").

options(openalexR.mailto = "[email protected]")
options(openalexR.apikey = "EXAMPLE_APIKEY")

Alternatively, you can open .Renviron with file.edit("~/.Renviron") and add:

openalexR.mailto = [email protected]
openalexR.apikey = EXAMPLE_APIKEY
library(openalexR)
library(dplyr)
library(ggplot2)

🌿 Examples

There are different filters/arguments you can use in oa_fetch, depending on which entity you’re interested in: works, authors, sources, funders, institutions, or concepts. We show a few examples below.

📚 Works

Goal: Download all information about a givens set of publications (known DOIs).

Use doi as a works filter:

works_from_dois <- oa_fetch(
  entity = "works",
  doi = c("10.1016/j.joi.2017.08.007", "https://doi.org/10.1007/s11192-013-1221-3"),
  verbose = TRUE
)
#> Requesting url: https://api.openalex.org/works?filter=doi%3A10.1016%2Fj.joi.2017.08.007%7Chttps%3A%2F%2Fdoi.org%2F10.1007%2Fs11192-013-1221-3
#> Getting 1 page of results with a total of 2 records...

We can view the output tibble/dataframe, works_from_dois, interactively in RStudio or inspect it with base functions like str or head. We also provide the experimental show_works function to simplify the result (e.g., remove some columns, keep first/last author) for easy viewing.

Note: the following table is wrapped in knitr::kable() to be displayed nicely in this README, but you will most likely not need this function.

# str(works_from_dois, max.level = 2)
# head(works_from_dois)
# show_works(works_from_dois)

works_from_dois |>
  show_works() |>
  knitr::kable()
id display_name first_author last_author so url is_oa top_concepts
W2755950973 bibliometrix : An R-tool for comprehensive science mapping analysis Massimo Aria Corrado Cuccurullo Journal of informetrics https://doi.org/10.1016/j.joi.2017.08.007 FALSE Workflow, Bibliometrics, Software
W2038196424 Coverage and adoption of altmetrics sources in the bibliometric community Stefanie Haustein Jens Terliesner Scientometrics https://doi.org/10.1007/s11192-013-1221-3 FALSE Altmetrics, Bookmarking, Social media

Goal: Download all works published by a set of authors (known ORCIDs).

Use author.orcid as a filter (either canonical form with https://orcid.org/ or without will work):

works_from_orcids <- oa_fetch(
  entity = "works",
  author.orcid = c("0000-0001-6187-6610", "0000-0002-8517-9411"),
  verbose = TRUE
)
#> Requesting url: https://api.openalex.org/works?filter=author.orcid%3A0000-0001-6187-6610%7C0000-0002-8517-9411
#> Getting 2 pages of results with a total of 260 records...
#> Warning in oa_request(oa_query(filter = filter_i, multiple_id = multiple_id, : 
#> The following work(s) have truncated lists of authors: W4230863633.
#> Query each work separately by its identifier to get full list of authors.
#> For example:
#>   lapply(c("W4230863633"), \(x) oa_fetch(identifier = x))
#> Details at https://docs.openalex.org/api-entities/authors/limitations.
works_from_orcids |>
  show_works() |>
  knitr::kable()
id display_name first_author last_author so url is_oa top_concepts
W2755950973 bibliometrix : An R-tool for comprehensive science mapping analysis Massimo Aria Corrado Cuccurullo Journal of informetrics https://doi.org/10.1016/j.joi.2017.08.007 FALSE Workflow, Bibliometrics, Software
W2741809807 The state of OA: a large-scale analysis of the prevalence and impact of Open Access articles Heather Piwowar Stefanie Haustein PeerJ https://doi.org/10.7717/peerj.4375 TRUE Citation, License, Bibliometrics
W2122130843 Scientometrics 2.0: New metrics of scholarly impact on the social Web Jason Priem Bradely H. Hemminger First Monday https://doi.org/10.5210/fm.v15i7.2874 FALSE Bookmarking, Altmetrics, Social media
W3005144120 Mapping the Evolution of Social Research and Data Science on 30 Years of Social Indicators Research Massimo Aria Maria Spano Social indicators research https://doi.org/10.1007/s11205-020-02281-3 FALSE Human geography, Data collection, Position (finance)
W2038196424 Coverage and adoption of altmetrics sources in the bibliometric community Stefanie Haustein Jens Terliesner Scientometrics https://doi.org/10.1007/s11192-013-1221-3 FALSE Altmetrics, Bookmarking, Social media
W2408216567 Foundations and trends in performance management. A twenty-five years bibliometric analysis in business and public administration domains Corrado Cuccurullo Fabrizia Sarto Scientometrics https://doi.org/10.1007/s11192-016-1948-8 FALSE Domain (mathematical analysis), Content analysis, Public domain

Goal: Download all works that have been cited more than 50 times, published between 2020 and 2021, and include the strings “bibliometric analysis” or “science mapping” in the title. Maybe we also want the results to be sorted by total citations in a descending order.

works_search <- oa_fetch(
  entity = "works",
  title.search = c("bibliometric analysis", "science mapping"),
  cited_by_count = ">50",
  from_publication_date = "2020-01-01",
  to_publication_date = "2021-12-31",
  options = list(sort = "cited_by_count:desc"),
  verbose = TRUE
)
#> Requesting url: https://api.openalex.org/works?filter=title.search%3Abibliometric%20analysis%7Cscience%20mapping%2Ccited_by_count%3A%3E50%2Cfrom_publication_date%3A2020-01-01%2Cto_publication_date%3A2021-12-31&sort=cited_by_count%3Adesc
#> Getting 2 pages of results with a total of 258 records...
works_search |>
  show_works() |>
  knitr::kable()
id display_name first_author last_author so url is_oa top_concepts
W3160856016 How to conduct a bibliometric analysis: An overview and guidelines Naveen Donthu Weng Marc Lim Journal of business research https://doi.org/10.1016/j.jbusres.2021.04.070 TRUE Bibliometrics, Field (mathematics), Resource (disambiguation)
W3038273726 Investigating the emerging COVID-19 research trends in the field of business and management: A bibliometric analysis approach Surabhi Verma Anders Gustafsson Journal of business research https://doi.org/10.1016/j.jbusres.2020.06.057 TRUE Bibliometrics, Field (mathematics), Empirical research
W3001491100 Software tools for conducting bibliometric analysis in science: An up-to-date review José A. Moral-Muñoz Manuel J. Cobo �El �Profesional de la información https://doi.org/10.3145/epi.2020.ene.03 TRUE Bibliometrics, Visualization, Set (abstract data type)
W2990450011 Forty-five years of Journal of Business Research: A bibliometric analysis Naveen Donthu Debidutta Pattnaik Journal of business research https://doi.org/10.1016/j.jbusres.2019.10.039 FALSE Publishing, Bibliometrics, Empirical research
W3044902155 Financial literacy: A systematic review and bibliometric analysis Kirti Goyal Satish Kumar International journal of consumer studies https://doi.org/10.1111/ijcs.12605 FALSE Financial literacy, Content analysis, Citation
W2990688366 A bibliometric analysis of board diversity: Current status, development, and future research directions H. Kent Baker Arunima Haldar Journal of business research https://doi.org/10.1016/j.jbusres.2019.11.025 FALSE Diversity (politics), Ethnic group, Bibliometrics

🧑 Authors

Goal: Download author information when we know their ORCID.

Here, instead of author.orcid like earlier, we have to use orcid as an argument. This may be a little confusing, but again, a different entity (authors instead of works) requires a different set of filters.

authors_from_orcids <- oa_fetch(
  entity = "authors",
  orcid = c("0000-0001-6187-6610", "0000-0002-8517-9411")
)

authors_from_orcids |>
  show_authors() |>
  knitr::kable()
id display_name orcid works_count cited_by_count affiliation_display_name top_concepts
A5069892096 Massimo Aria 0000-0002-8517-9411 192 8282 University of Naples Federico II Physiology, Pathology and Forensic Medicine, Periodontics
A5023888391 Jason Priem 0000-0001-6187-6610 67 2541 OurResearch Statistics, Probability and Uncertainty, Information Systems, Communication

Goal: Acquire information on the authors of this package.

We can use other filters such as display_name and has_orcid:

authors_from_names <- oa_fetch(
  entity = "authors",
  display_name = c("Massimo Aria", "Jason Priem"),
  has_orcid = TRUE
)
authors_from_names |>
  show_authors() |>
  knitr::kable()
id display_name orcid works_count cited_by_count affiliation_display_name top_concepts
A5069892096 Massimo Aria 0000-0002-8517-9411 192 8282 University of Naples Federico II Physiology, Pathology and Forensic Medicine, Periodontics
A5023888391 Jason Priem 0000-0001-6187-6610 67 2541 OurResearch Statistics, Probability and Uncertainty, Information Systems, Communication

Goal: Download all authors’ records of scholars who work at the University of Naples Federico II (OpenAlex ID: I71267560) and have published at least 500 publications.

Let’s first check how many records match the query, then download the entire collection. We can do this by first defining a list of arguments, then adding count_only (default FALSE) to this list:

my_arguments <- list(
  entity = "authors",
  last_known_institutions.id = "I71267560",
  works_count = ">499"
)

do.call(oa_fetch, c(my_arguments, list(count_only = TRUE)))
#>      count db_response_time_ms page per_page
#> [1,]    36                 177    1        1
if (do.call(oa_fetch, c(my_arguments, list(count_only = TRUE)))[1]>0){
do.call(oa_fetch, my_arguments) |>
  show_authors() |>
  knitr::kable()
}
id display_name orcid works_count cited_by_count affiliation_display_name top_concepts
A5063152727 L. Lista 0000-0001-6471-5492 2374 73504 INFN Sezione di Napoli Nuclear and High Energy Physics, Nuclear and High Energy Physics, Nuclear and High Energy Physics
A5069689088 C. Sciacca 0000-0002-8412-4072 2372 60702 INFN Sezione di Napoli Nuclear and High Energy Physics, Nuclear and High Energy Physics, Nuclear and High Energy Physics
A5019451576 Alberto Orso Maria Iorio 0000-0002-3798-1135 1227 29599 INFN Sezione di Napoli Nuclear and High Energy Physics, Nuclear and High Energy Physics, Nuclear and High Energy Physics
A5078843367 G. De Nardo NA 968 28236 University of Naples Federico II Nuclear and High Energy Physics, Nuclear and High Energy Physics, Nuclear and High Energy Physics
A5076706548 Salvatore Capozziello 0000-0003-4886-2024 930 34384 University of Naples Federico II Astronomy and Astrophysics, Nuclear and High Energy Physics, Astronomy and Astrophysics
A5023058736 Francesco Fienga 0000-0001-5978-4952 846 17271 University of Naples Federico II Nuclear and High Energy Physics, Nuclear and High Energy Physics, Nuclear and High Energy Physics

🍒 Example analyses

Goal: track the popularity of Biology concepts over time.

We first download the records of all level-1 concepts/keywords that concern over one million works:

library(gghighlight)
concept_df <- oa_fetch(
  entity = "concepts",
  level = 1,
  ancestors.id = "https://openalex.org/C86803240", # Biology
  works_count = ">1000000"
)

concept_df |>
  select(display_name, counts_by_year) |>
  tidyr::unnest(counts_by_year) |>
  filter(year < 2022) |>
  ggplot() +
  aes(x = year, y = works_count, color = display_name) +
  facet_wrap(~display_name) +
  geom_line(linewidth = 0.7) +
  scale_color_brewer(palette = "Dark2") +
  labs(
    x = NULL, y = "Works count",
    title = "Virology spiked in 2020."
  ) +
  guides(color = "none") +
  gghighlight(
    max(works_count) > 200000,
    min(works_count) < 400000,
    label_params = list(nudge_y = 10^5, segment.color = NA)
  )
#> label_key: display_name

Goal: Rank institutions in Italy by total number of citations.

We want download all records regarding Italian institutions (country_code:it) that are classified as educational (type:education). Again, we check how many records match the query then download the collection:

italy_insts <- oa_fetch(
  entity = "institutions",
  country_code = "it",
  type = "education",
  verbose = TRUE
)
#> Requesting url: https://api.openalex.org/institutions?filter=country_code%3Ait%2Ctype%3Aeducation
#> Getting 2 pages of results with a total of 232 records...
italy_insts |>
  slice_max(cited_by_count, n = 8) |>
  mutate(display_name = forcats::fct_reorder(display_name, cited_by_count)) |>
  ggplot() +
  aes(x = cited_by_count, y = display_name, fill = display_name) +
  geom_col() +
  scale_fill_viridis_d(option = "E") +
  guides(fill = "none") +
  labs(
    x = "Total citations", y = NULL,
    title = "Italian references"
  ) +
  coord_cartesian(expand = FALSE)

And what do they publish on?

# The package wordcloud needs to be installed to run this chunk
# library(wordcloud)

concept_cloud <- italy_insts |>
  select(inst_id = id, topics) |>
  tidyr::unnest(topics) |>
  filter(name == "field") |>
  select(display_name, count) |>
  group_by(display_name) |>
  summarise(score = sqrt(sum(count)))

pal <- c("black", scales::brewer_pal(palette = "Set1")(5))
set.seed(1)
wordcloud::wordcloud(
  concept_cloud$display_name,
  concept_cloud$score,
  scale = c(2, .4),
  colors = pal
)

Goal: Visualize big journals’ topics.

We first download all records regarding journals that have published more than 300,000 works, then visualize their scored concepts:

# The package ggtext needs to be installed to run this chunk
# library(ggtext)

jours_all <- oa_fetch(
  entity = "sources",
  works_count = ">200000",
  verbose = TRUE
)

clean_journal_name <- function(x) {
  x |>
    gsub("\\(.*?\\)", "", x = _) |>
    gsub("Journal of the|Journal of", "J.", x = _) |>
    gsub("/.*", "", x = _)
}

jours <- jours_all |>
  filter(type == "journal") |>
  slice_max(cited_by_count, n = 9) |>
  distinct(display_name, .keep_all = TRUE) |>
  select(jour = display_name, topics) |>
  tidyr::unnest(topics) |>
  filter(name == "field") |>
  group_by(id, jour, display_name) |> 
  summarise(score = (sum(count))^(1/3), .groups = "drop") |> 
  left_join(concept_abbrev, by = join_by(id, display_name)) |>
  mutate(
    abbreviation = gsub(" ", "<br>", abbreviation),
    jour = clean_journal_name(jour),
  ) |>
  tidyr::complete(jour, abbreviation, fill = list(score = 0)) |>
  group_by(jour) |>
  mutate(
    color = if_else(score > 10, "#1A1A1A", "#D9D9D9"), # CCCCCC
    label = paste0("<span style='color:", color, "'>", abbreviation, "</span>")
  ) |>
  ungroup()

jours |>
  ggplot() +
  aes(fill = jour, y = score, x = abbreviation, group = jour) +
  facet_wrap(~jour) +
  geom_hline(yintercept = c(25, 50), colour = "grey90", linewidth = 0.2) +
  geom_segment(
    aes(x = abbreviation, xend = abbreviation, y = 0, yend = 55),
    color = "grey95"
  ) +
  geom_col(color = "grey20") +
  coord_polar(clip = "off") +
  theme_bw() +
  theme(
    plot.background = element_rect(fill = "transparent", colour = NA),
    panel.background = element_rect(fill = "transparent", colour = NA),
    panel.grid = element_blank(),
    panel.border = element_blank(),
    axis.text = element_blank(),
    axis.ticks.y = element_blank()
  ) +
  ggtext::geom_richtext(
    aes(y = 75, label = label),
    fill = NA, label.color = NA, size = 3
  ) +
  scale_fill_brewer(palette = "Set1", guide = "none") +
  labs(y = NULL, x = NULL, title = "Journal clocks")

❄️ Snowball search

The user can also perform snowballing with oa_snowball. Snowballing is a literature search technique where the researcher starts with a set of articles and find articles that cite or were cited by the original set. oa_snowball returns a list of 2 elements: nodes and edges. Similar to oa_fetch, oa_snowball finds and returns information on a core set of articles satisfying certain criteria, but, unlike oa_fetch, it also returns information the articles that cite and are cited by this core set.

# The packages ggraph and tidygraph need to be installed to run this chunk
library(ggraph)
library(tidygraph)
#> 
#> Attaching package: 'tidygraph'
#> The following object is masked from 'package:stats':
#> 
#>     filter
snowball_docs <- oa_snowball(
  identifier = c("W1964141474", "W1963991285"),
  verbose = TRUE
)
#> Requesting url: https://api.openalex.org/works?filter=openalex%3AW1964141474%7CW1963991285
#> Getting 1 page of results with a total of 2 records...
#> Collecting all documents citing the target papers...
#> Requesting url: https://api.openalex.org/works?filter=cites%3AW1963991285%7CW1964141474
#> Getting 3 pages of results with a total of 540 records...
#> Collecting all documents cited by the target papers...
#> Requesting url: https://api.openalex.org/works?filter=cited_by%3AW1963991285%7CW1964141474
#> Getting 1 page of results with a total of 91 records...
ggraph(graph = as_tbl_graph(snowball_docs), layout = "stress") +
  geom_edge_link(aes(alpha = after_stat(index)), show.legend = FALSE) +
  geom_node_point(aes(fill = oa_input, size = cited_by_count), shape = 21, color = "white") +
  geom_node_label(aes(filter = oa_input, label = id), nudge_y = 0.2, size = 3) +
  scale_edge_width(range = c(0.1, 1.5), guide = "none") +
  scale_size(range = c(3, 10), guide = "none") +
  scale_fill_manual(values = c("#a3ad62", "#d46780"), na.value = "grey", name = "") +
  theme_graph() +
  theme(
    plot.background = element_rect(fill = "transparent", colour = NA),
    panel.background = element_rect(fill = "transparent", colour = NA),
    legend.position = "bottom"
  ) +
  guides(fill = "none")

🌾 N-grams

OpenAlex offers (limited) support for fulltext N-grams of Work entities (these have IDs starting with "W"). Given a vector of work IDs, oa_ngrams returns a dataframe of N-gram data (in the ngrams list-column) for each work.

ngrams_data <- oa_ngrams(
  works_identifier = c("W1964141474", "W1963991285"),
  verbose = TRUE
)

ngrams_data
#> # A tibble: 2 × 4
#>   id                               doi                              count ngrams
#>   <chr>                            <chr>                            <int> <list>
#> 1 https://openalex.org/W1964141474 https://doi.org/10.1016/j.conb.…  2733 <df>  
#> 2 https://openalex.org/W1963991285 https://doi.org/10.1126/science…  2338 <df>
lapply(ngrams_data$ngrams, head, 3)
#> [[1]]
#>                                        ngram ngram_count ngram_tokens
#> 1                 brain basis and core cause           2            5
#> 2                     cause be not yet fully           2            5
#> 3 include structural and functional magnetic           2            5
#>   term_frequency
#> 1   0.0006637902
#> 2   0.0006637902
#> 3   0.0006637902
#> 
#> [[2]]
#>                                          ngram ngram_count ngram_tokens
#> 1          intact but less accessible phonetic           1            5
#> 2 accessible phonetic representation in Adults           1            5
#> 3       representation in Adults with Dyslexia           1            5
#>   term_frequency
#> 1   0.0003756574
#> 2   0.0003756574
#> 3   0.0003756574
ngrams_data |>
  tidyr::unnest(ngrams) |>
  filter(ngram_tokens == 2) |>
  select(id, ngram, ngram_count) |>
  group_by(id) |>
  slice_max(ngram_count, n = 10, with_ties = FALSE) |>
  ggplot(aes(ngram_count, forcats::fct_reorder(ngram, ngram_count))) +
  geom_col(aes(fill = id), show.legend = FALSE) +
  facet_wrap(~id, scales = "free_y") +
  labs(
    title = "Top 10 fulltext bigrams",
    x = "Count",
    y = NULL
  )

oa_ngrams can sometimes be slow because the N-grams data can get pretty big, but given that the N-grams are "cached via CDN"](https://docs.openalex.org/api-entities/works/get-n-grams#api-endpoint), you may also consider parallelizing for this special case (oa_ngrams does this automatically if you have {curl} >= v5.0.0).

💫 About OpenAlex

oar-img

oar-img

Schema credits: @dhimmel

OpenAlex is a fully open catalog of the global research system. It’s named after the ancient Library of Alexandria. The OpenAlex dataset describes scholarly entities and how those entities are connected to each other. There are five types of entities:

  • Works are papers, books, datasets, etc; they cite other works

  • Authors are people who create works

  • Sources are journals and repositories that host works

  • Institutions are universities and other orgs that are affiliated with works (via authors)

  • Concepts tag Works with a topic

🤝 Code of Conduct

Please note that this package is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

👓 Acknowledgements

Package hex was made with Midjourney and thus inherits a CC BY-NC 4.0 license.

openalexr's People

Contributors

adam3smith avatar maelle avatar massimoaria avatar trangdata avatar yhan818 avatar yjunechoe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

openalexr's Issues

pkgcheck results - main

Checks for openalexR (v1.0.2.9000)

git hash: a938e831

  • ✔️ Package is already on CRAN.
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✔️ Package coverage is 92.4%.
  • ✔️ R CMD check found no errors.
  • ✔️ R CMD check found no warnings.

Package License: MIT + file LICENSE

Add examples for working with complex tibbles

It would be really helpful I think to add an example or two to the readme showing how data in some of the sub-fields can be extracted from the large, complicated tibble returned by oa_fetch.

For example I'm struggling to figure out how to extract a list of institution_id's associated with a set of works. Or the latitude and longitude from a list of institutions.

oaApiRequest with throw an error if query url doesn't have initial parameters

Hi Massimo! Thanks for all your great work on this.

I noticed two things in the oaApiRequest source code that will lead to a 404 error. First, Line 155 assumes that a query parameter has already been passed by leading with an &. The user could work around this by adding a mailto = argument, but on Line 147 the string is never appended because reassignment doesn't occur.

The first issue is always difficult because you never know what the user will pass. Maybe you could take care of this using grepl and a regex pattern? Then moving the "cursor=*" before the mailto given that it is a constant.

oaApiRequest <- function(query_url,
                         total.count = FALSE,
                         mailto = NULL,
                         verbose = FALSE) {
    
    ua <- httr::user_agent(cfg()$user_agent)
    
    ## >> ---- suggestion
    query_anchor <-
        ifelse(grepl("+[^?#]+\\?[^#:]+", query_url), "&", "?")
    
    query_url <- paste0(query_url, query_anchor, "cursor=*")
   ## End suggestion ---- << 

    if (!is.null(mailto)) {
        if (isValidEmail(mailto)) {
            ## >> Then this would change
            query_url <- paste0(query_url, "&mailto:", mailto)
        } else {
            message(mailto, " is not a valid email address")
        }
    }
    
    if (verbose == TRUE) {
        message("Requesting url: ", query_url)
    }
    
    res <- oa_request(query_url, ua)
    
# ...
}

Another suggestion might be to not hardcode the user agent and let Rcurl take care of it. If this package gets really popular OpenAlex could flag that UA because so many people are using it.

cfg <- function(.ua =  base::getOption("HTTPUserAgent")) {
    ##>> maybe something like this
    if (is.null(.ua) || length(.ua) == 0L) {
        .ua <-
            paste0(
                "curl/",
                curl::curl_version()[[1]],
                " RCurl/",
                packageVersion("RCurl"),
                " httr/",
                packageVersion("httr")
            )
    }
    
    res <- list(user_agent = .ua)
    
    if (Sys.getenv("OPENALEX_USERAGENT") != "") {
        res$user_agent <- Sys.getenv("OPENALEX_USERAGENT")
    }
  return (res)
}

oa_fetch order of arguments

I came across this when revising the package for ropensci. Curious to hear what you think @massimoaria @yjunechoe. 🌈

Small change, but I was thinking of switching the position of entity and identifier in oa_fetch. https://github.com/massimoaria/openalexR/blob/33bb462c773dc2e091336f9a4dd284fedf3bcefc/R/oa_fetch.R#L58-L59
I think it would be natural to write oa_fetch("works", ...) without having to specify entity = . Plus, the identifier argument is very rarely used in most cases. I don't think this will affect the behavior of the function that much but will allow us to write the query a little faster from now on.

Let me know what you think.

pkgcheck results - main

Checks for openalexR (v1.0.2)

git hash: c0d32fea

  • ✔️ Package is already on CRAN.
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✔️ Package coverage is 92.4%.
  • ✔️ R CMD check found no errors.
  • ✔️ R CMD check found no warnings.

Package License: MIT + file LICENSE

Error with oa2df

I noticed that the oa_fetch function sometimes errors out during the "converting" step.

Here is my example query and the error message:

oa_fetch(
  entity = "works",
  title.search = "country of origin",
  publication_year = 2022,
  verbose = T
) 
Requesting url: https://api.openalex.org/works?filter=title.search%3Acountry%20of%20origin%2Cpublication_year%3A2022
Getting 1 page of results with a total of 149 records...
Error in (function (..., deparse.level = 1, make.row.names = TRUE, stringsAsFactors = FALSE,  : 
  numbers of columns of arguments do not match

I've done some debugging and noticed that the conversion using oa2df is getting stuck on 1 of the records.

Duplicate package name

I've maybe missed something, but I was looking for an easy way to pass a list of author identifiers, and get works. I can see how to get the counts of works, lots of author details, and to get works given a work identifier, but not this.

In looking for the lazy solution (to save me writing queries), there is another package https://github.com/ekmaloney/openalexR which does indeed have this function. I wonder if the packages could be usefully merged or/and one renamed? Having just tried to run both (or rename post install)...it's not as easy as I'd hoped.

Thanks for the great package, it's incredibly useful and so much better to have open data than be reliant on commercial tools.

Update oa_random for multiple items

OpenAlex can now return more than one random work/author/etc. per API call.

It would be helpful to update oa_random to allow filters, a seed argument(?), and multiple results. for use cases such as this.

However, an important note from OpenAlex:

Depending on your query, random results with a seed value may change over time due to new records coming into OpenAlex.

I think the seed argument makes sense for queries that happen close together (within a day) but we would have to warn the user about its use (at least first time) and how their query may not be reproducible on a later date.

pkgcheck results - main

Checks for openalexR (v1.0.2.9000)

git hash: ed9b3c4e

  • ✔️ Package is already on CRAN.
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✔️ Package coverage is 89.9%.
  • ✔️ R CMD check found no errors.
  • ✔️ R CMD check found no warnings.

Package License: MIT + file LICENSE

Parse `concepts$score` as numeric

Currently concepts$score gets default parsing as character, but in my experience the API consistently returns valid numeric values. For convenience, it'd be nice to always convert this column if concepts exists for a paper

paper <- oa_fetch("W2755950973", "works")
paper$concepts
#> [[1]]
#>                                 id                               wikidata     display_name level     score
#> 1   https://openalex.org/C41008148   https://www.wikidata.org/wiki/Q21198 Computer science     0 0.6361582
#> 2 https://openalex.org/C2522767166 https://www.wikidata.org/wiki/Q2374463     Data science     1 0.3629617
sapply(paper$concepts[[1]], typeof)
#>           id     wikidata display_name        level        score 
#>  "character"  "character"  "character"    "integer"  "character"

pkgcheck results - main

Checks for openalexR (v1.0.2.9000)

git hash: 33bb462c

  • ✔️ Package is already on CRAN.
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✔️ Package coverage is 86.2%.
  • ✔️ R CMD check found no errors.
  • ✔️ R CMD check found no warnings.

Package License: MIT + file LICENSE

pkgcheck results - main

Checks for openalexR (v1.0.2.9000)

git hash: 04bde6d8

  • ✔️ Package is already on CRAN.
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✔️ Package coverage is 86.2%.
  • ✖️ R CMD check found 1 error.
  • ✔️ R CMD check found no warnings.

Important: All failing checks above must be addressed prior to proceeding

Package License: MIT + file LICENSE

Authors' display name is exact match. Is there any way to do a fuzzy search ?

For your code example searching authors' name. You have the code "entity = "authors", display_name = c("Massimo Aria", "Jason Priem"), The display_name is an exact match.

However, there are authors who have middle names. If the middle name is in OpenAlex's dataset, the search will return nothing. I have identified a few authors who have middle name listed.

Is your package allow search something like "Massimo * Aria"? (where * can be 0 or more characters for the middle name).

pkgcheck results - main

Checks for openalexR (v1.0.2.9000)

git hash: f9e85639

  • ✔️ Package is already on CRAN.
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✔️ Package coverage is 91.5%.
  • ✔️ R CMD check found no errors.
  • ✔️ R CMD check found no warnings.

Package License: MIT + file LICENSE

pkgcheck results - main

Checks for openalexR (v1.0.2.9000)

git hash: 8c515004

  • ✔️ Package is already on CRAN.
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✔️ Package coverage is 91.7%.
  • ✔️ R CMD check found no errors.
  • ✔️ R CMD check found no warnings.

Package License: MIT + file LICENSE

Error from `oa_fetch` during conversion: `l_inst[inst_idx][[1]]` : subscript out of bounds

Unless I'm mistaken, this should be a valid query for works with authors in both Canada and Korea, related to artificial intelligence.

library('openalexR')
library('dplyr')
oa_fetch(
  entity='works',
  authorships.institutions.country_code='CA',
  authorships.institutions.country_code='KR',
  count_only = FALSE,
  verbose = TRUE,
  concepts.id='C154945302'
)

If I do count_only=TRUE I see there are 1,842 works which sounds like it's in the right ballpark at least. When I run the code above though to get the actual works, I get the following error message:

Requesting url: https://api.openalex.org/works?filter=authorships.institutions.country_code%3ACA%2Cauthorships.institutions.country_code%3AKR%2Cconcepts.id%3AC154945302
About to get a total of 10 pages of results with a total of 1842 records.
  OpenAlex downloading [=====================] 100% eta:  0s
  converting [========================>------]  80% eta:  3sError in l_inst[inst_idx][[1]] : subscript out of bounds

I just updated the package to the latest developer version per the instructions in the readme.

pkgcheck results - for-ropensci

Checks for openalexR (v1.0.2)

git hash: d81b0d0b

  • ✔️ Package is already on CRAN.
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✔️ Package coverage is 92.4%.
  • ✔️ R CMD check found no errors.
  • ✔️ R CMD check found no warnings.

Package License: MIT + file LICENSE

Error in oa2df

Thanks for such a useful package! I had used it a few weeks ago and it worked, but now, when using oa2df to convert the retrieved info into a dataframe, I get an error message:
Error in *tmp*[[jj]] : subscript out of bounds
Thanks!

Empty `concepts` should be coded as `NULL` not `NA`

Since concepts is a list-column, missingness should be coded as NULL instead of NA.

The offending line is in works2df(): if the fetched $concepts element of a work is empty (list()), subs_na() replaces that with NA:

https://github.com/massimoaria/openalexR/blob/7e5f70ae4f53509126305ec1359b828ff845d15e/R/oa2df.R#L163-L167

NA values in list columns cause problems for unnesting and rowwise/map workflows:

x <- oa_fetch(c("W2755950973", "W4296015844"), "works")
x$concepts
#> [[1]]
#>                                 id                               wikidata     display_name level     score
#> 1   https://openalex.org/C41008148   https://www.wikidata.org/wiki/Q21198 Computer science     0 0.6361582
#> 2 https://openalex.org/C2522767166 https://www.wikidata.org/wiki/Q2374463     Data science     1 0.3629617
#> 
#> [[2]]
#> [1] NA

# Unnesting counts `NA` as a value
x |> 
  select(concepts) |> 
  unnest(concepts)
#> # A tibble: 3 × 5
#>   id                               wikidata                               display_name     level score    
#>   <chr>                            <chr>                                  <chr>            <int> <chr>    
#> 1 https://openalex.org/C41008148   https://www.wikidata.org/wiki/Q21198   Computer science     0 0.6361582
#> 2 https://openalex.org/C2522767166 https://www.wikidata.org/wiki/Q2374463 Data science         1 0.3629617
#> 3 <NA>                             <NA>                                   <NA>                NA <NA>

# `is.na()` check on a data.frame is not scalar 
x |> 
  rowwise() |> 
  filter(!is.na(concepts) && "Data science" %in% concepts$display_name) |> 
  select(display_name, concepts)
#> Warning in !is.na(concepts) && "Data science" %in% concepts$display_name: 'length(x) = 10 > 1' in coercion to 'logical(1)'
#> # A tibble: 1 × 2
#> # Rowwise: 
#>   display_name                                                        concepts    
#>   <chr>                                                               <list>      
#> 1 bibliometrix : An R-tool for comprehensive science mapping analysis <df [2 × 5]>

is.na(x$concepts[[1]])
#>         id wikidata display_name level score
#> [1,] FALSE    FALSE        FALSE FALSE FALSE
#> [2,] FALSE    FALSE        FALSE FALSE FALSE

Both issues are solved if missing concepts are coded as NULL

x$concepts[2] <- list(NULL)
x$concepts
#> [[1]]
#>                                 id                               wikidata     display_name level     score
#> 1   https://openalex.org/C41008148   https://www.wikidata.org/wiki/Q21198 Computer science     0 0.6361582
#> 2 https://openalex.org/C2522767166 https://www.wikidata.org/wiki/Q2374463     Data science     1 0.3629617
#> 
#> [[2]]
#> [1] NULL

x |> 
  select(concepts) |> 
  unnest(concepts)
#> # A tibble: 2 × 5
#>   id                               wikidata                               display_name     level score    
#>   <chr>                            <chr>                                  <chr>            <int> <chr>    
#> 1 https://openalex.org/C41008148   https://www.wikidata.org/wiki/Q21198   Computer science     0 0.6361582
#> 2 https://openalex.org/C2522767166 https://www.wikidata.org/wiki/Q2374463 Data science         1 0.3629617

x |> 
  rowwise() |> 
  filter(!is.null(concepts) && ("Data science" %in% concepts$display_name)) |> 
  select(display_name, concepts)
#> # A tibble: 1 × 2
#> # Rowwise: 
#>   display_name                                                        concepts    
#>   <chr>                                                               <list>      
#> 1 bibliometrix : An R-tool for comprehensive science mapping analysis <df [2 × 5]>

Automatic batching in oa_fetch filters

Currently, openalex allows 50 entities to be combined in a query. So the following gives an API error when my_dois has more than 50 components.

oa_fetch(
  doi = my_dois,
  entity = "works"
)

It may be nice if oa_fetch automatically batches these entities if more than 50 entities in the filter are requested.

Related to the oa_snowball and ggraph tidygraph code

Hi

When running this:

ggraph(graph = as_tbl_graph(snowball_docs), layout = "stress") +
geom_edge_link(aes(alpha = after_stat(index)), show.legend = FALSE) +
geom_node_point(aes(fill = oa_input, linewidth = cited_by_count), shape = 21) +
geom_node_label(aes(filter = oa_input, label = id), nudge_y = 0.2, size = 3 ) +
scale_edge_width(range = c(0.1, 1.5), guide = "none") +
scale_size(range = c(3, 10), guide = "none") +
scale_fill_manual(values = c("#1A5878", "#C44237"), na.value = "grey", name = "") +
theme_graph() +
theme(legend.position = "bottom") +
guides(fill = "none")

I get this error:

Warning message:
Using the size aesthetic in this geom was deprecated in ggplot2 3.4.0.
Please use linewidth in the default_aes field and elsewhere instead.

Is this a problem of the ggraph package talking with the new version of ggplot?

Cheers

Chris Buddenhagen

biblio information seems to be missing

GREAT package, thank you! I'm trying to fill blanks in RIS files and I cannot find the volume, issue, first_page, last_page data in oa2df(). I know this data is in 'res' (the output of oaApiRequest(), but wondered if the bibliographic information was intentionally dropped from oa2df() and if not whether it could be added to the df output? It would save me doing a case-by-case workaround. Thanks!

Include is_retracted and is_paratext in oa2df?

Thanks for this package -- very useful. I'd like to filter out irrelevant results from a sample that I'm constructing using oa_random (so can't use the API filters) and it'd be very useful to have access to is_retracted and is_paratext -- happy to create a PR (this seems pretty straightforward) but wanted to make sure these aren't committed for a reason.

Error in converting JSON object into a data frame

@trangdata

I just discovered this bug trying to download documents cited by a group of papers.

library(openalexR)

# example data
df <-  oa_fetch(
  entity="works",
  authorships.institutions.ror="040evg982"
)


cited <- oa_fetch(
  entity = "works",
  cited_by = df$id[1:50],
  verbose = FALSE
)
#> Error in names(x) <- paste(prefix, names(x), sep = "_"): 'names' attribute [1] must be the same length as the vector [0]
7. prepend(l$author, "au") at oa2df.R#182
6. FUN(X[[i]], ...)
5. lapply(paper$authorships, function(l) {
                l_inst <- l$institutions
                inst_idx <- lengths(l_inst) > 0
                 if (length(inst_idx) > 0 && any(inst_idx)) { ... at oa2df.R#172
4. do.call(rbind.data.frame, lapply(paper$authorships, function(l) {
                 l_inst <- l$institutions
                 inst_idx <- lengths(l_inst) > 0
                 if (length(inst_idx) > 0 && any(inst_idx)) { ... at oa2df.R#172
3. works2df(data, abstract, verbose) at oa2df.R#52
2. oa2df(res, entity = entity, abstract = abstract, count_only = count_only,
group_by = group_by, verbose = verbose) at oa_fetch.R#116
1. oa_fetch(entity = "works", cited_by = df$id[1:50], verbose = FALSE)

Created on 2022-10-19 with reprex v2.0.2

associated_institutions returned as a single nested row rather than a nested data.frame

Hi there,

Firstly, thank you for this package. It's awesome.

One issue I have found is that the nested associated_institutions table returned from institutions2df appears to be one long row rather than a table when there are multiple entries.

library(openalexR)

inst <- oa_fetch(identifier = "I1292875679")
inst$associated_institutions

This institution has 37 associated institutions but each id etc is returned in its own column numbered id, id1, id2, ..., rather than in a single id column.

> inst$associated_institutions
[[1]]
id                       ror         display_name country_code     type relationship                             id.1
1 https://openalex.org/I4210138055 https://ror.org/03n17ds51 Agriculture and Food           AU facility        child https://openalex.org/I4210166208
ror.1                   display_name.1 country_code.1   type.1 relationship.1                             id.2                     ror.2
1 https://ror.org/05rke7t32 Animal, Food and Health Sciences             AU facility          child https://openalex.org/I4210146430 https://ror.org/04ynn1b95
display_name.2 country_code.2   type.2 relationship.2                             id.3                     ror.3                        display_name.3
1 Astronomy and Space             AU facility          child https://openalex.org/I1299164729 https://ror.org/05qajvd42 Australia Telescope National Facility
country_code.3   type.3 relationship.3                             id.4                     ror.4                             display_name.4 country_code.4
1             AU facility          child https://openalex.org/I1338668087 https://ror.org/02aseym49 Australian Centre for Disease Preparedness             AU
type.4 relationship.4                             id.5                     ror.5                               display_name.5 country_code.5  type.5
1 facility          child https://openalex.org/I4210106161 https://ror.org/01qv3ez98 Australian National Algae Culture Collection             AU archive
relationship.5                             id.6                     ror.6                      display_name.6 country_code.6  type.6 relationship.6
1          child https://openalex.org/I4210145531 https://ror.org/05hdbs804 Australian National Fish Collection             AU archive          child
id.7                     ror.7                display_name.7 country_code.7  type.7 relationship.7
1 https://openalex.org/I4210130210 https://ror.org/02gkh1e90 Australian National Herbarium             AU archive          child
id.8                     ror.8                        display_name.8 country_code.8  type.8 relationship.8
1 https://openalex.org/I4210089744 https://ror.org/00c8nx045 Australian National Insect Collection             AU archive          child
id.9                     ror.9                          display_name.9 country_code.9  type.9 relationship.9
1 https://openalex.org/I4210158368 https://ror.org/059mabc80 Australian National Wildlife Collection             AU archive          child
id.10                    ror.10                      display_name.10 country_code.10  type.10 relationship.10
1 https://openalex.org/I4210138528 https://ror.org/03rzhkf33 Australian Resources Research Centre              AU facility           child
id.11                    ror.11             display_name.11 country_code.11 type.11 relationship.11
1 https://openalex.org/I4210165867 https://ror.org/05p3fde54 Australian Tree Seed Centre              AU archive           child
id.12                    ror.12                     display_name.12 country_code.12  type.12 relationship.12
1 https://openalex.org/I4210141844 https://ror.org/04ywhbc61 Australian e-Health Research Centre              AU facility           child
id.13                    ror.13              display_name.13 country_code.13  type.13 relationship.13
1 https://openalex.org/I4210142128 https://ror.org/03jh4jw93 CSIRO Health and Biosecurity              AU facility           child
id.14                    ror.14      display_name.14 country_code.14  type.14 relationship.14                            id.15
1 https://openalex.org/I4210161554 https://ror.org/057xz1h85 CSIRO Land and Water              AU facility           child https://openalex.org/I4210154771
ror.15     display_name.15 country_code.15  type.15 relationship.15                            id.16                    ror.16
1 https://ror.org/04sx9wp33 CSIRO Manufacturing              AU facility           child https://openalex.org/I1281210470 https://ror.org/026nh4520
display_name.16 country_code.16  type.16 relationship.16                            id.17                    ror.17  display_name.17
1 CSIRO Oceans and Atmosphere              AU facility           child https://openalex.org/I4210155681 https://ror.org/051hpv692 CSIRO Publishing
country_code.17 type.17 relationship.17                            id.18                    ror.18            display_name.18 country_code.18  type.18
1              AU   other           child https://openalex.org/I4210118967 https://ror.org/02cgy3m12 CSIRO Scientific Computing              AU facility
relationship.18                            id.19                    ror.19      display_name.19 country_code.19    type.19 relationship.19
1           child https://openalex.org/I2800098157 https://ror.org/029dswp54 Central Land Council              AU government           child
id.20                    ror.20                display_name.20 country_code.20 type.20 relationship.20
1 https://openalex.org/I4210118312 https://ror.org/02xhx4j26 Centre for Marine Socioecology              AU   other           child
id.21                    ror.21                                display_name.21 country_code.21  type.21 relationship.21
1 https://openalex.org/I4210115089 https://ror.org/029pamw34 Centre for Southern Hemisphere Oceans Research              AU facility           child
id.22                    ror.22                                           display_name.22 country_code.22  type.22
1 https://openalex.org/I4210109753 https://ror.org/01ew37b76 Collaboration for Australian Weather and Climate Research              AU facility
relationship.22                          id.23                    ror.23 display_name.23 country_code.23 type.23 relationship.23
1           child https://openalex.org/I42894916 https://ror.org/03q397159          Data61              AU   other           child
id.24                    ror.24    display_name.24 country_code.24  type.24 relationship.24                            id.25
1 https://openalex.org/I4210119913 https://ror.org/02bbj5z24 Division of Energy              AU facility           child https://openalex.org/I4210165412
ror.25                 display_name.25 country_code.25  type.25 relationship.25                            id.26
1 https://ror.org/05xx7en86 Division of Fossil Fuels Energy              AU facility           child https://openalex.org/I4210166020
ror.26    display_name.26 country_code.26  type.26 relationship.26                            id.27                    ror.27
1 https://ror.org/05mbqa235 Ecosystem Sciences              AU facility           child https://openalex.org/I4210101388 https://ror.org/0152bt112
display_name.27 country_code.27  type.27 relationship.27                            id.28                    ror.28
1 Health Sciences and Nutrition              AU facility           child https://openalex.org/I4210128581 https://ror.org/034x2fx50
display_name.28 country_code.28  type.28 relationship.28                            id.29                    ror.29
1 Information and Communication Technologies Centre              AU facility           child https://openalex.org/I4210106369 https://ror.org/01mae9353
display_name.29 country_code.29  type.29 relationship.29                            id.30                    ror.30                 display_name.30
1 Marine National Facility              AU facility           child https://openalex.org/I4210129794 https://ror.org/03rs0fg31 Materials Science & Engineering
country_code.30  type.30 relationship.30                            id.31                    ror.31   display_name.31 country_code.31  type.31
1              AU facility           child https://openalex.org/I4210130959 https://ror.org/039b65w79 Mineral Resources              AU facility
relationship.31                            id.32                    ror.32                  display_name.32 country_code.32 type.32 relationship.32
1           child https://openalex.org/I4210136678 https://ror.org/041v1ea26 NCMI Information and Data Centre              AU archive           child
id.33                    ror.33                 display_name.33 country_code.33  type.33 relationship.33
1 https://openalex.org/I4210126632 https://ror.org/031ebne21 National Measurement Laboratory              AU facility           child
id.34                    ror.34 display_name.34 country_code.34  type.34 relationship.34                            id.35
1 https://openalex.org/I4210152679 https://ror.org/05jg9pj51  Plant Industry              AU facility           child https://openalex.org/I4210158855
ror.35                                       display_name.35 country_code.35    type.35 relationship.35                            id.36
1 https://ror.org/056naxb15 Department of Industry, Science, Energy and Resources              AU government          parent https://openalex.org/I4210102203
ror.36           display_name.36 country_code.36 type.36 relationship.36
1 https://ror.org/018n2ja79 Atlas of Living Australia              AU archive         related

Looking at the code, I think associated_institutions should be processed using type = "rbind_df" rather than type = "row_df"

I'm not sure if it's possible for an institution to have multiple geo entries but if it is, that nested table may suffer the same problem.

Cheers

pkgcheck results - main

Checks for openalexR (v1.0.2.9000)

git hash: 520ef390

  • ✔️ Package is already on CRAN.
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✔️ Package coverage is 86.2%.
  • ✔️ R CMD check found no errors.
  • ✔️ R CMD check found no warnings.

Package License: MIT + file LICENSE

pkgcheck results - main

Checks for openalexR (v1.0.2.9000)

git hash: 7e5f70ae

  • ✔️ Package is already on CRAN.
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✔️ Package coverage is 91.7%.
  • ✖️ R CMD check found 1 error.
  • ✔️ R CMD check found no warnings.

Important: All failing checks above must be addressed prior to proceeding

Package License: MIT + file LICENSE

Package webpage

Hi @massimoaria thank you for a great package! 🚀

I'm learning more about OpenAlex and would love to contribute to your package while I'm learning. Would you be open to that? I can help make a webpage for your package and standardize some aspects such as tests, LICENSE, or creating a README.Rmd (which would allow for automatic syntax highlighting).

Let me know what you think! Again, awesome work!!! 💯

Breaking change

A recent commit has introduced a breaking change (possibly deliberate). Apologies I didn't very systematically record things as I went. I've just re-checked, this previously worked, where members$ORCID is a column of ORCID ids (without the url prepended). I tried rolling back and the function still works at this commit openalexR@0114bb3b3dd2c26c98b4aa8ee2da6162189a0ccf

oa_fetch( entity = "authors", orcid = members$ORCID )

Not sure if it's related but I'm having issues using oa_fetch for multiple OA identifiers (fixed by using lapply over the list of them, and using bind_rows for now), and then using the oa2df which gives Error in if (!is.na(paper$biblio[1])) { : argument is of length zero

The oa2bibliometrix function either ran, but striped authors and gave odd rownames (sorry I didn't explore what field from), or Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 865, 0 In addition: Warning messages: 1: Unknown or uninitialised column: CR. 2: Unknown or uninitialised column: concept.

pkgcheck results - main

Checks for openalexR (v1.0.2)

git hash: e967011c

  • ✔️ Package is already on CRAN.
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✖️ Package coverage failed
  • ✔️ R CMD check found no errors.
  • ✔️ R CMD check found no warnings.

Important: All failing checks above must be addressed prior to proceeding

Package License: MIT + file LICENSE

Snowball with multiple inputs sometimes return NA in `edges`

I haven't hunted down the exact source of this bug yet but here's the smallest reprex I can make:

library(openalexR)

set <- c("W1992473459", "W4246185059", "W3119850932")
snowballed <- oa_snowball(set) # finds ~200 papers

library(dplyr)

snowballed$edges |> 
  filter(if_any(everything(), is.na))
#> # A tibble: 1 × 2
#>   from        to   
#>   <chr>       <chr>
#> 1 W4246185059 <NA>

Note that the "from" paper here is one of the snowballed inputs, so it looks like OpenAlex/{openalexR} "finds" a citation from one of the input papers to some paper that's NA here for some reason.

Among other things, this causes a pretty cryptic error in tidygraph::as_tbl_graph(), which goes away if you ensure that no edge is to/from NA:

library(tidygraph)
as_tbl_graph(snowballed)
#> Error in (function (edges, n = max(edges), directed = TRUE) : At core/constructors/basic_constructors.c:72 : Invalid (non-finite or NaN) vertex index when creating graph. Invalid value
snowballed$edges <- na.omit(snowballed$edges)
as_tbl_graph(snowballed)
#> # A tbl_graph: 210 nodes and 209 edges
#> #
#> # A directed acyclic simple graph with 2 components
#> #
#> # Node Data: 210 × 29 (active)
#>   id    displa… author ab    public… releva… so    so_id publis… issn  url   first_… last_p… volume issue is_oa
#>   <chr> <chr>   <list> <lgl> <chr>   <lgl>   <chr> <chr> <chr>   <lis> <chr> <chr>   <chr>   <chr>  <chr> <lgl>
#> 1 W199… The de… <df>   NA    2009-0… NA      Cogn… http… Elsevi… <chr> <NA>  <NA>    <NA>    <NA>   <NA>  FALSE
#> 2 W424… Role o… <df>   NA    1988-0… NA      Chil… http… Wiley   <chr> http… 897     897     59     4     FALSE
#> 3 W311… Learni… <df>   NA    2021-0… NA      Cogn… http… Elsevi… <chr> <NA>  104576  104576  210    <NA>  FALSE
#> 4 W214… Bootst… <df>   NA    2010-0… NA      Cogn… http… Wiley   <chr> <NA>  752     775     34     5     FALSE
#> 5 W206… Separa… <df>   NA    1991-0… NA      Cogn… http… Elsevi… <chr> <NA>  263     298     23     2     FALSE
#> 6 W196… Where … <df>   NA    2010-0… NA      Jour… http… Informa <chr> <NA>  356     373     11     3     FALSE
#> # … with 204 more rows, and 13 more variables: cited_by_count <int>, counts_by_year <list>,
#> #   publication_year <int>, cited_by_api_url <chr>, ids <list>, doi <chr>, type <chr>, referenced_works <list>,
#> #   related_works <list>, is_paratext <lgl>, is_retracted <lgl>, concepts <list>, oa_input <lgl>
#> #
#> # Edge Data: 209 × 2
#>    from    to
#>   <int> <int>
#> 1     4     1
#> 2     5     2
#> 3     6     1
#> # … with 206 more rows

Just for the record here's the actual larger snowball search I ran where I first saw the bug (not ran here):

big_set <- c("W2059799772", "W2018234095", "W2019203623", "W4252968383", "W4238563879",
             "W1992473459", "W4246185059", "W3119850932", "W4229439296")
# finds ~5000 papers; takes a while
oa_snowball(big_set)$edges |> 
  filter(if_any(everything(), is.na))
#> # A tibble: 3 × 2
#>   from        to   
#>   <chr>       <chr>
#> 1 W4252968383 <NA> 
#> 2 W4238563879 <NA> 
#> 3 W4246185059 <NA> 

oa_ngrams errors on Mac

@yjunechoe
I tried the function oa_ngrams on my MacBook M1, both with Curl 5 and <5, and always returns this errors:

library(openalexR)

id <- c('https://openalex.org/W2150220236','https://openalex.org/W2120109270','https://openalex.org/W2755950973','https://openalex.org/W2061474427','https://openalex.org/W3125707221','https://openalex.org/W2062021443','https://openalex.org/W1767272795','https://openalex.org/W2068452509','https://openalex.org/W2108680868','https://openalex.org/W2019753053')

ngrams_data <- oa_ngrams(id, options("oa_ngrams.message.curlv5" = TRUE))
#> Warning in file.remove(ngrams_files$destfile): cannot remove file
#> '/var/folders/l3/zmw6hy357xj2vgwrfl9ggbkc0000gn/T//RtmpRzXryV/https://openalex.org/W2150220236',
#> reason 'No such file or directory'
#> Warning in file.remove(ngrams_files$destfile): cannot remove file
#> '/var/folders/l3/zmw6hy357xj2vgwrfl9ggbkc0000gn/T//RtmpRzXryV/https://openalex.org/W2120109270',
#> reason 'No such file or directory'
#> Warning in file.remove(ngrams_files$destfile): cannot remove file
#> '/var/folders/l3/zmw6hy357xj2vgwrfl9ggbkc0000gn/T//RtmpRzXryV/https://openalex.org/W2755950973',
#> reason 'No such file or directory'
#> Warning in file.remove(ngrams_files$destfile): cannot remove file
#> '/var/folders/l3/zmw6hy357xj2vgwrfl9ggbkc0000gn/T//RtmpRzXryV/https://openalex.org/W2061474427',
#> reason 'No such file or directory'
#> Warning in file.remove(ngrams_files$destfile): cannot remove file
#> '/var/folders/l3/zmw6hy357xj2vgwrfl9ggbkc0000gn/T//RtmpRzXryV/https://openalex.org/W3125707221',
#> reason 'No such file or directory'
#> Warning in file.remove(ngrams_files$destfile): cannot remove file
#> '/var/folders/l3/zmw6hy357xj2vgwrfl9ggbkc0000gn/T//RtmpRzXryV/https://openalex.org/W2062021443',
#> reason 'No such file or directory'
#> Warning in file.remove(ngrams_files$destfile): cannot remove file
#> '/var/folders/l3/zmw6hy357xj2vgwrfl9ggbkc0000gn/T//RtmpRzXryV/https://openalex.org/W1767272795',
#> reason 'No such file or directory'
#> Warning in file.remove(ngrams_files$destfile): cannot remove file
#> '/var/folders/l3/zmw6hy357xj2vgwrfl9ggbkc0000gn/T//RtmpRzXryV/https://openalex.org/W2068452509',
#> reason 'No such file or directory'
#> Warning in file.remove(ngrams_files$destfile): cannot remove file
#> '/var/folders/l3/zmw6hy357xj2vgwrfl9ggbkc0000gn/T//RtmpRzXryV/https://openalex.org/W2108680868',
#> reason 'No such file or directory'
#> Warning in file.remove(ngrams_files$destfile): cannot remove file
#> '/var/folders/l3/zmw6hy357xj2vgwrfl9ggbkc0000gn/T//RtmpRzXryV/https://openalex.org/W2019753053',
#> reason 'No such file or directory'

Created on 2023-01-17 with reprex v2.0.2

Snowball search by author(s)

@trangdata
Talking with Corrado Cuccurullo, we realized that it should be a good idea to add the possibility of performing a snowball search by an author, searching all his citing or cited documents.

Flattened 1-table snowball output

Moving this out of #9 as a separate issue

Input (from output of oa_snowball())

Let's say we conduct a snowball search of papers B and D which finds A and C, where A cites B and C is both cited by B and cites D.

Search graph

graph LR;
  B -. forward .->A
  B -- backward --> C
  D -. forward .-> C
  
style A fill:#fff
style C fill:#fff
Loading

Edges

from to
A B
B C
C D

Nodes

id ... oa_input
A ... FALSE
B ... TRUE
C ... FALSE
D ... TRUE

Proposed output (of to_disk())

The motivation behind to_disk() is to augment the paper metadata in $nodes with additional information about connectivity/relevance/importance of each paper in $edges, in a single-table format for interactive use.

What if to_disk() adds a column called connection which is semicolon-separated values of input ids that are discovered it. This collapses cites/cited_by relationships and just shows info about "proximity" to the input set. So for the above graph the to_disk() representation would look something like:

id ... oa_input connections
A ... FALSE B
B ... TRUE
C ... FALSE B; D
D ... TRUE

If inputs cite each other, connections is also a place where that information can be stored. Additionally we might also add a n_connections column for ease of sorting when researchers are prioritizing papers for screening in excel/googlesheets.

Missing Abstracts

The ab field used to contain the abstract of the publication. For some reason now it is all NAs. Is it a deprecated field? Here a comparison of metadata retrieved some months ago vs today (after updating to the ropensci instead of massimoaria version of openalexR).
Screenshot 2023-02-28 161228
Screenshot 2023-02-28 161249

Unhelpful rownames from oa2bibliometrix()

I'm not sure if the rownames are needed for downstream analyses of bibliometrix, but perhaps we should remove (also, the SR column already has this information). What do you think @massimoaria?

library(openalexR)
dat <- oa2bibliometrix(oa_fetch(
  entity = "works",
  cites = "W2755950973",
  from_publication_date = "2022-01-01",
  to_publication_date = "2022-01-31"
))

head(rownames(dat))
#> [1] "NA, , V99313352"   "NA, , V4210226067" "NA, , V201530359" 
#> [4] "NA, , V59624048"   "NA, , V121203305"  "NA, , V68497187"

Created on 2022-09-10 with reprex v2.0.2

Multiple author institutions lost from works

Hi there, oa2df() appears to be dropping subsequent institutional affiliations from authors when returning works.

See this example:
https://explore.openalex.org/works/W2898962279
image

The 3rd Author Tim McVicar is affiliated with both the Australian Research Council and CSIRO Land and Water.

The raw JSON from oa_request() includes both affiliations

library(openalexR)
library(dplyr)
oa_query(identifier = "W2898962279") %>% oa_request()

(output below is just the relevant subset because it's long)

$authorships[[3]]
$authorships[[3]]$author_position
[1] "middle"

$authorships[[3]]$author
$authorships[[3]]$author$id
[1] "https://openalex.org/A2013114412"

$authorships[[3]]$author$display_name
[1] "Tim R. McVicar"

$authorships[[3]]$author$orcid
[1] "https://orcid.org/0000-0002-0877-8285"


$authorships[[3]]$institutions
$authorships[[3]]$institutions[[1]]
$authorships[[3]]$institutions[[1]]$id
[1] "https://openalex.org/I1337719021"

$authorships[[3]]$institutions[[1]]$display_name
[1] "Australian Research Council"

$authorships[[3]]$institutions[[1]]$ror
[1] "https://ror.org/05mmh0f86"

$authorships[[3]]$institutions[[1]]$country_code
[1] "AU"

$authorships[[3]]$institutions[[1]]$type
[1] "government"


$authorships[[3]]$institutions[[2]]
$authorships[[3]]$institutions[[2]]$id
[1] "https://openalex.org/I4210161554"

$authorships[[3]]$institutions[[2]]$display_name
[1] "CSIRO Land and Water"

$authorships[[3]]$institutions[[2]]$ror
[1] "https://ror.org/057xz1h85"

$authorships[[3]]$institutions[[2]]$country_code
[1] "AU"

$authorships[[3]]$institutions[[2]]$type
[1] "facility"



$authorships[[3]]$raw_affiliation_string
[1] "Australian Research Council Centre of Excellence for Climate System Science, Sydney, Australia"

But when using oa_fetch() the flattening process appears to lose CSIRO Land and Water.

oa_fetch("W2898962279")$author[[1]]$institution_display_name
[1] "Princeton University"        "ETH Zurich"                  "Australian Research Council" "Princeton University"        "Princeton University"        "Princeton University"   

author table in Rstudio
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.