Coder Social home page Coder Social logo

arcticreport's Introduction

arcticreport

This package generates reporting metrics for the NSF Arctic Data Center.

Installation

remotes::install_github("NCEAS/arcticreport")

Acknowledgments

Work on this package was supported by:

  • NSF-PLR grant #1546024 to M. B. Jones, S. Baker-Yeboah, J. Dozier, M. Schildhauer, and A. Budden

nceas_footer

arcticreport's People

Contributors

dvirlar2 avatar jeanetteclark avatar justinkadi avatar mbjones avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arcticreport's Issues

Decide on a caching mechanism for query results

The two primary query functions in the package: query_objects and query_version_chains take 20 minutes and 100 minutes to run, respectively. query_objects returns a data.frame with a row for every object in the ADC. query_version_chains takes the result of query_objects and assigns an arbitrary series identifier to each version chain. The rest of the functionality in the package is slicing, dicing, summarizing, and plotting metrics based on those two tables.

Since the functions take so long to run, it is definitely not practical to run these two functions often. For CI, we could build a status page that runs everything once a day or so, and fills in tables for the quarterly metrics when those milestones show up. For local testing, or creating one-off kind of plots, it would be beneficial to set up a standard way of caching those query results for ease of use.

Open to any suggestions. The bigger of the two tables is about 100MB when saved to disk.

simplify process for getting file counts

The current report arcticreport.Rmd file instructs the user to ssh to datateam, run a couple of complex commands, bring the data back to the running R session with specific names, and then runs some R code to transform that into a usable data frame that contains a list of the package identifiers, date uploaded, and sizes of the datasets stored on the filesystem. It would be helpful to simplify this significantly.

make `plot_cumulative_metric` smarter

It works in certain cases, and you can plot cumulative count or size of either data files or metadata files, but it could be smarter at:

  • units (for the size metric)
  • axis labels
  • where in time the plot starts
  • placement/size of the ADC start line

misc. metrics count issues

Carried over from issue #11 -

I found an issue where a bunch (500ish) rows had mostly missing values in the objects data frame, which caused an error when converting dateUploaded to a date object. For now I just filtered them out on

arcticreport/R/plot_cumulative_volume.R

Line 17 in cedd9f3
dplyr::filter(!is.na(.data$dateUploaded)) %>%
, but we should really dive in and find out where these rows originate and why the fields are missing:

Determine why fields are missing from a bunch of rows and fix the issue

Additionally, some of the tests I wrote with hard coded values for things like number of datasets in a particular month are failing. This means that at some point we calculated that there were say 100 datasets submitted between January 1 and February 1 of 2021, but for some reason now we calculate that 103 datasets were submitted then. We need to figure out what is causing our counts to (slightly) change. It might be related to the above

review list of creator names that are removed

the count_creators function has a list of creators that are removed. The comment in the code says:

# Grep-based filters
# Bryce created these (and we can expand these) based upon what I saw in the results
# that looked like organizations or non-persons of some sort or another

We should review this list against the list of unique creators and decide if we want to expand, revise, or altogether remove this list.

make `query_objects` faster

I had the idea on a call that there is a way we can make query_objects much faster by keeping parts of the cache that are still relevant. Below are some changes that would need to be made

  • add dateModified to the fields returned by the query function
  • if a cache is found, filter to keep all objects with a dateModified older than the datetime at runtime
  • query for objects only with a dateModified more recent than the datetime at runtime
  • save the cache
  • if no cache is found, then make sure the datetime you are querying against is very far in the past so that you get all of the objects
  • remove the cache_tolerance parameter

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.