nceas / arcticreport Goto Github PK

View Code? Open in Web Editor NEW

0.0 14.0 3.0 237 KB

License: Other

R 100.00%

arcticreport's Introduction

arcticreport

Authors: Jeanette Clark, Matthew B. Jones, Bryce Mecum
License: Apache 2
Package source code on Github
Submit Bugs and feature requests

This package generates reporting metrics for the NSF Arctic Data Center.

Installation

remotes::install_github("NCEAS/arcticreport")

Acknowledgments

Work on this package was supported by:

NSF-PLR grant #1546024 to M. B. Jones, S. Baker-Yeboah, J. Dozier, M. Schildhauer, and A. Budden

arcticreport's People

Contributors

Watchers

Forkers

dvirlar2 maggieklope justinkadi

arcticreport's Issues

Decide on a caching mechanism for query results

The two primary query functions in the package: query_objects and query_version_chains take 20 minutes and 100 minutes to run, respectively. query_objects returns a data.frame with a row for every object in the ADC. query_version_chains takes the result of query_objects and assigns an arbitrary series identifier to each version chain. The rest of the functionality in the package is slicing, dicing, summarizing, and plotting metrics based on those two tables.

Since the functions take so long to run, it is definitely not practical to run these two functions often. For CI, we could build a status page that runs everything once a day or so, and fills in tables for the quarterly metrics when those milestones show up. For local testing, or creating one-off kind of plots, it would be beneficial to set up a standard way of caching those query results for ease of use.

Open to any suggestions. The bigger of the two tables is about 100MB when saved to disk.

Consider implementing `data.table` for speed in `count_creators`

The unnesting step in this function (tidyr::unnest_longer) is fairly slow, performance I think could be improved by using data.table instead, as shown here. Only downside is that it adds a dependency

make cumulative size growth chart

should look like this, need it quarterly

simplify process for getting file counts

The current report arcticreport.Rmd file instructs the user to ssh to datateam, run a couple of complex commands, bring the data back to the running R session with specific names, and then runs some R code to transform that into a usable data frame that contains a list of the package identifiers, date uploaded, and sizes of the datasets stored on the filesystem. It would be helpful to simplify this significantly.

make `plot_cumulative_metric` smarter

It works in certain cases, and you can plot cumulative count or size of either data files or metadata files, but it could be smarter at:

units (for the size metric)
axis labels
where in time the plot starts
placement/size of the ADC start line

misc. metrics count issues

Carried over from issue #11 -

I found an issue where a bunch (500ish) rows had mostly missing values in the objects data frame, which caused an error when converting dateUploaded to a date object. For now I just filtered them out on

arcticreport/R/plot_cumulative_volume.R

Line 17 in cedd9f3
dplyr::filter(!is.na(.data$dateUploaded)) %>%
, but we should really dive in and find out where these rows originate and why the fields are missing:

Determine why fields are missing from a bunch of rows and fix the issue

Additionally, some of the tests I wrote with hard coded values for things like number of datasets in a particular month are failing. This means that at some point we calculated that there were say 100 datasets submitted between January 1 and February 1 of 2021, but for some reason now we calculate that 103 datasets were submitted then. We need to figure out what is causing our counts to (slightly) change. It might be related to the above

review list of creator names that are removed

the count_creators function has a list of creators that are removed. The comment in the code says:

# Grep-based filters
# Bryce created these (and we can expand these) based upon what I saw in the results
# that looked like organizations or non-persons of some sort or another

We should review this list against the list of unique creators and decide if we want to expand, revise, or altogether remove this list.

Investigate alternate ways of finding unique creators

The way we count creators right now is not ideal - just looking at names. Are there better ways to do this? ORCIDs are an obvious choice, but much of the legacy data doesn't have ORCIDs.

make `query_objects` faster

I had the idea on a call that there is a way we can make query_objects much faster by keeping parts of the cache that are still relevant. Below are some changes that would need to be made

add dateModified to the fields returned by the query function
if a cache is found, filter to keep all objects with a dateModified older than the datetime at runtime
query for objects only with a dateModified more recent than the datetime at runtime
save the cache
if no cache is found, then make sure the datetime you are querying against is very far in the past so that you get all of the objects
remove the cache_tolerance parameter