epinowcast / epinowcast Goto Github PK

Tools to enable flexible and efficient hierarchical nowcasting of epidemiological time-series using a semi-mechanistic Bayesian model with support for a range of reporting and generative processes.

Home Page: https://package.epinowcast.org/

License: Other

R 80.08% Dockerfile 0.46% Shell 3.46% Stan 14.59% TeX 1.41%

nowcasting stan cmdstanr epidemiology outbreak-analysis pandemic-preparedness effective-reproduction-number-estimation infectious-disease-surveillance real-time-infectious-disease-modelling

epinowcast's Introduction

Flexible Hierarchical Nowcasting

Summary

Tools to enable flexible and efficient hierarchical nowcasting of right-truncated epidemiological time-series using a semi-mechanistic Bayesian model with support for a range of reporting and generative processes. Nowcasting, in this context, is gaining situational awareness using currently available observations and the reporting patterns of historical observations. This can be useful when tracking the spread of infectious disease in real-time: without nowcasting, changes in trends can be obfuscated by partial reporting or their detection may be delayed due to the use of simpler methods like truncation. While the package has been designed with epidemiological applications in mind, it could be applied to any set of right-truncated time-series count data.

Installation

Installing the package

You can install the latest released version using the normal R function, though you need to point to r-universe instead of CRAN:

install.packages(
  "epinowcast", repos = "https://epinowcast.r-universe.dev"
)

Alternatively, you can use the remotes package to install the development version from Github (warning! this version may contain breaking changes and/or bugs):

remotes::install_github(
  "epinowcast/epinowcast", dependencies = TRUE
)

Similarly, you can install historical versions by specifying the release tag (e.g. this installs 0.2.0):

remotes::install_github(
  "epinowcast/epinowcast", dependencies = TRUE, ref = "v0.2.0"
)

Note: You can also use that last approach to install a specific commit if needed, e.g. if you want to try out a specific unreleased feature, but not the absolute latest developmental version.

Installing CmdStan

If you wish to do model fitting and nowcasting, you will need to install CmdStan, which also entails having a suitable C++ toolchain setup. We recommend using the cmdstanr package. The Stan team provides instructions in the Getting started with cmdstanr vignette, with other details and support at the package site along with some key instructions available in the Stan resources package vignette, but the brief version is:

# if you not yet installed `epinowcast`, or you installed it without `Suggests` dependencies
install.packages("cmdstanr", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
# once `cmdstanr` is installed:
cmdstanr::install_cmdstan()

Note: You can speed up CmdStan installation using the cores argument. If you are installing a particular version of epinowcast, you may also need to install a past version of CmdStan, which you can do with the version argument.

Alternative: Docker

We also provide a Docker image with epinowcast and all dependencies installed. You can use this image to run epinowcast without installing dependencies.

Resources

As you use the package, the documentation available via ?enw_ should be your first stop for troubleshooting. We also provide a range of other documentation, case studies, and community spaces to ask (and answer!) questions:

Package Website

The epinowcast website includes a function reference, model outline, and case studies using the package. The site mainly concerns the release version, but you can also find documentation for the latest development version.

R Vignettes

We have created package vignettes to help you get started nowcasting and to highlight other features with case studies.

Organisation Website

Our organisation website includes links to other resources, guest posts, and seminar schedule for both upcoming and past recordings.

Community Forum

Our community forum has areas for question and answer and considering new methods and tools, among others. If you are generally interested in real-time analysis of infectious disease, you may find this useful even if do not use epinowcast.

Package Analysis Scripts

In addition to the vignettes, the package also comes with example analyses. These are not as polished as the vignettes, but we typically explore new features with these and they may help you if you are using a development version. After installing epinowcast, you can find them via:

list.files(
  system.file("examples", package = "epinowcast"), full.names = TRUE
)

Contributing

We welcome contributions and new contributors! We particularly appreciate help on identifying and identified issues. Please check and add to the issues, and/or add a pull request and see our contributing guide for more information.

If you need a different underlying model for your work: epinowcast lets you pass your own models! If you do try new model parameterisations that expand the overall flexibility or improve the defaults, please let us know either here or on the community forum. We always like to hear about new use-cases, whether or not they are directed at the core epinowcast applications.

How to make a bug report or feature request

Please briefly describe your problem and what output you expect in an issue. If you have a question, please don’t open an issue. Instead, ask on our Q and A page. See our contributing guide for more information.

Code of Conduct

Please note that the epinowcast project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Citation

If making use of our methodology or the methodology on which ours is based, please cite the relevant papers from our model outline. If you use epinowcast in your work, please consider citing it with citation("epinowcast").

Contributors

All contributions to this project are gratefully acknowledged using the allcontributors package following the all-contributors specification. Contributions of any kind are welcome!

epinowcast's People

Contributors

Stargazers

Watchers

epinowcast's Issues

Update the expectation random walk to use the cumulative sum formulation

It is more efficient to encode random walks using a cumulative sum formulation. There should be no downsides to this change.

This section of the code

https://github.com/epiforecasts/epinowcast/blob/38ab91013fe105935c11bf2f9c199a2fbf0c2110/inst/stan/epinowcast.stan#L131

needs to be updated to use the same approach as seen here.

https://github.com/epiforecasts/epinowcast/blob/a48ae0a9ee0652d8b1ffbf6220484ec9c1de2114/inst/stan/epinowcast.stan#L141

Redesign interface

This is a proposal for a new interface. The basic idea is to build the model up as a series of chunks related to the sub-modules. There are open questions about how much of the data munging process (i.e which formulas use which data) the user should be exposed to.

A possible candidate interface:

epinowcast(
    data, model = enw_model(), reporting = enw_reporting(...), expectation = enw_expectation(...),
    missing = enw_missing(...), observation = enw_obs(...), fit = enw_fit(...)
)

enw_reporting( 
  parametric = ~ 1, | ~ 0 # none
  distribution = "log-normal", # | "none" = no parametric baseline hazard
  non_parametric = ~ 0 # (none) | ~ d (each day i.e discrete) | ~ d + age_group (cox model assuming proportional hazards by age group)
  reporting = ~ rw(week, by = age_group), | ~ 0 (or ~1) #none
  structural = ~ saturdays # (known days reporting cannot happen and hazard to -Inf),
  data = obs
)

enw_expectation(
  ~ rw(date, by = age_group), 
  offset = enw_offset(weights = 1),
  order = 2, #i.e second order = a growth rate model, # third order is difference in growth rate model and so on
  family = "none",  # here could have for example "poisson"
  delay_to_report = 1  # here could have the delay from inf to report as a pmf. Later extend to uncertain dists
  data = obs
)

enw_missing(
   ~ 1, 
   data = obs
)

enw_obs(
  baseline = ~ 1 # we could offfer the ability to model the variance parameter if present by baseline (i.e delay,) report and reference date
  report = ~ 1
  reference = ~ 1
  family = "negbin" #other options could be Poisson etc or even potentially non-count options. 
)

enw_fit(
  init = epinowcast::enw_init,
  fit = epinowcast::enw_sample,
  as_data_list = epinowcast::enw_as_data_list,
  ... # additional parameters passed to fit when called.
)

Add support for variational inference

Variational inference is available as a method in cmstanr but is marked experimental. It is likely that variational inference won't work as well as HMC in many circumstances but for some users, the performance/run-time trade-off may be attractive.

See more: https://arxiv.org/abs/1506.03431
cmdstanr interface: https://mc-stan.org/cmdstanr/reference/model-method-variational.html

Example datasets

At the moment some data.table is used at the beginning of package examples etc to filter data. I am a little concerned this may put off some users who won't realise that data.table knowledge is not at all needed to use the packege.

One potential solution to this (and to reduce repeated code in general) is to have a few variants of the current example dataset. Potentially these could be one at the national level without stratification by age, and one at the national level with stratification by age, and the current dataset (which is all states + by age).

Another option would be to write a small helper function to load and filter package data for the user?

I don't think fallling back to base is likely to reduce users fear.

Update to new scoringutils interface

@nikosbosse has just released a new version of scoringutils which has an updated interface. We need to update the optional function supplied in the package to make using scoringutils easier for package users to use this interface.

https://github.com/epiforecasts/epinowcast/blob/735c6c48ec1eb93edeb72ca06b7325493b44672b/R/model-validation.R#L19

scoringutils changelog: https://epiforecasts.io/scoringutils/news/index.html

Add function to extract posterior draws for nowcast

This is one a useful helper (as linking with dates is quite fiddly and annoying) and two needed for Germany nowcasting where the target is the 7-day average and so a weekly sum of posterior samples needs to be made prior to taking the summary

Bug: Pkgdown site failing to build

Currently the pkgdown site failing to build (see https://github.com/epiforecasts/epinowcast/runs/5998039307?check_suite_focus=true). This looks like it was introduced in #41 (tagging @adrian-lison).

From a quick skim it looks like it is because the package index is now incomplete. This can be fixed by adding a concept (for example util) to the documentation of the new remove_profiling function.

Add tests

Add tests using forecast.vocs as an example (as testing long-running models is tricky and this provides an example).

Add tests for inference

Add tests for core model fitting (using forecast.vocs as a model)

Add actual function documentation

Most functions are not documented fully or in reality. This is not ideal.

Warnings at initialisation for complex models

Currently, there are warnings thrown when models become complex. This could indicate an issue and so the investigation is needed to try and resolve this.

Add tests of visualisation

Add tests of visualisations using vdiffr

Update contributing guide to reflect develop, main branch interaction.

We need to update the contributing guide to reflect the use of the develop branch as an unstable development branch where all changes should be targeted and the main branch as a stable release branch (due to not having the ability to release to CRAN and the potential for breaking changes).

Support for explicit reporting schedule

Some datasets may have an explicit reporting schedule (such as no reporting on weekends or holidays). A simplistic approach to account for this is to fit a, for example, day of the week reporting pattern but where the reporting schedule is known this leads to needless model complexity and potentially numeric instability. This approach may also have issues when only special days have no reports (such as bank holidays in the UK or religious festivals).

An alternative is to explicitly define the reporting schedule for the model by setting the hazard for that day to be 0. Taking this approach has been shown (in unpublished work?) to increase stability and improve performance.

In terms of implementation in epinowcast there are two options. The first is to use the current report day model with a special setting so that instead of estimating logit hazards they are set to -Inf for specified terms. The second is to add an additional model definition argument (potentially reporting_schedule or similar) that declares the schedule for the model explicit. This is likely the most flexible approach.

One approach to implementation would be (note the precise look-up behaviour needed requires some looking at the current report day model code in detail):

Implementing a simulated dataset for testing or adding a new example dataset + checking this doesn't break the current implementation.
Adding support for a reporting schedule vector to expected_observations to set the hazard to 0 if no reporting is possible

https://github.com/epiforecasts/epinowcast/blob/main/inst/stan/functions/expected-observations.stan

Adding support to obs_lmpf to allocate report day effects and similarly in generated quantities.

https://github.com/epiforecasts/epinowcast/blob/main/inst/stan/functions/obs_lpmf.stan
https://github.com/epiforecasts/epinowcast/blob/081db1f2ac582439375df4e19f458f1272230e53/inst/stan/epinowcast.stan#L176

Passing in a vector of 0,-Infs to the stan code along with a look up that links report days with individual nowcasts
Adding an argument to enw_as_data_list to pass in the required vector of reporting schedules by dates + allocation per nowcast
https://github.com/epiforecasts/epinowcast/blob/081db1f2ac582439375df4e19f458f1272230e53/R/model.R#L110
Adding a new function enw_reporting_schedule that takes the metareport data and a formula argument specifying the report schedule (for example ~ weekends + holidays). This would then use a similar approach to enw_formula to produce a design matrix (with no intercept) for any row with non-zero columns it would imply no reporting (i.e a -Inf in the vector and 0 otherwise) in the returned reporting schedule vector to be ingested by stan. This function would also need some additional thinking to make sure nowcasts as linked to appropriate reporting days (as done in enw_formula etc).

https://github.com/epiforecasts/epinowcast/blob/081db1f2ac582439375df4e19f458f1272230e53/R/model-design-tools.R#L182

Adding an example to the epinowcast docs highlighting this option
Updating the model definition vignette with this new feature.
Ideally adding some meaningful unit tests of this feature (though optional as whole package is currently not unit tested).

An example dataset where this occurs is Swiss hospital admissions: https://www.nzz.ch/schweiz/omikron-variante-wie-verlaesslich-sind-die-spitalzahlen-noch-ld.1664007

Add package hex logo

Everyone loves a hex logo. I would think that a simple logo using a stylised version of the plot in the README might be quite nice though very open to other ideas.

Implementation wise something like this: https://github.com/epiforecasts/covidregionaldata/blob/master/inst/make_hexsticker.R

which ends up looking like this: https://github.com/epiforecasts/covidregionaldata#readme

Add safe compile time optimisations

See here: https://discourse.mc-stan.org/t/how-to-use-stanc3-compiler-optimization-flag-o-o1-in-cmdstanr/26365/3

Support exponential delay distributions

At the moment only lognormal and gamma delay distributions are supported. Given many example datasets exhibit near exponential reporting distributions and that this distribution depends on only a single parameter (and hence is easier to fit) it makes sense to support it. Note that this should in theory be similar to a model with a constant hazard (whilst being less efficient). This means that when this model is implemented it is unlikely that users will want to use the parametric approach.

This requires:

Updating the internal stan code to allow for logsd to be an optional parameter.
Add support for estimating the exponential pdf.
Add R code supporting passing initial conditions
Add R code adding the exponential distribution to the user interface.

Add additional preprocessing tests

Currently preprocessing is tested as a single unit in an integration test. Further unit testing of each function is really required to ensure everything functions as expected.

Add Felix as author

Add @FelixGuenther as an author. Need name, email, and orcid.

Add true formula interface

At the moment the formula interface is flexible but not nice to use. Adding a true formula interface would improve the workflow with the package. This needs:

Support fixed effects by passing directly model.matrix
Support random effects (passing (1 + x | y) correctly)
Support random walks (likely need to add a rw helper and extract this from the formula
Continue to support the current interface as likely above changes will not support all possible models.

New title and DESCRIPTION

The package's current title is not really correct as what is being correct is not censoring but truncation. It seems like there are better titles available especially as the package expands into being a more general real-time analysis toolkit. At the same time it makes sense to update the package description (found in both the DESCRIPTION file and at the beginning of the README) to match this new title.

`0.1.0` release checklist

This issue summarises the work needed ahead of the release of 0.1.0 which is aiming to be the last release (at least for the moment) with major breaking changes. I think timeline wise sometime next week is a sensible aim. If there are parts of this we have yet to implement (with major features being the most likely thing to be missing). I'd suggest we release and put these in the next version.

Note should probably have used a project for this and will do so in the future!

Add observations visualisations and vignette

Add flexible tools to enable visualising observations both in general and specifically for various dimensions of reporting delays (i.e over the delays, over time etc). This should be done hand in hand with a vignette outlining there use.

Clarify that reference/occurrence date are the same thing

The documentation / example data sets use date of reference and occurrence interchangeably where it seems they're the same. I'd suggest to either decide on one and use consistently (preferred) or else to clarify at first use in each document that they're the same.

Diagnose and correction occasional convergence issues

Extended model-fitting has shown that complex models occasionally have fitting issues which are shown by a high number of divergent transitions and large R hats. In terms of nowcasts, this appears to lead to low quantiles with spuriously low estimates.

This issue needs more detail but fits with issues can be seen here: https://github.com/epiforecasts/eval-germany-sp-nowcasting/blob/main/data/diagnostics/high-divergent-transitions.csv

A summary report on nowcasting work in Germany includes these tables formatted for viewing: https://epiforecasts.io/eval-germany-sp-nowcasting/real-time/

Rtools version out of sync with cmdstanr

Rtools is out of sync with cmdstanr. This causes the CI checks to fail.

Warnings about un-initialised parameters

Running the example from the README yields warnings about missing init values.

nowcast <- epinowcast(pobs,
  model = model,
  report_effects = report_effects,
  reference_effects = reference_effects,
  save_warmup = FALSE, pp = TRUE,
  chains = 2, threads_per_chain = 2,
  iter_sampling = 500, iter_warmup = 500,
  show_messages = FALSE, refresh = 0
)

Init values were only set for a subset of parameters. 
Missing init values for the following parameters:
 - chain 1: logmean_eff, logsd_eff, logmean_sd, logsd_sd
 - chain 2: logmean_eff, logsd_eff, logmean_sd, logsd_sd

Running MCMC with 2 parallel chains, with 2 thread(s) per chain...

Chain 2 finished in 40.2 seconds.
Chain 1 finished in 43.4 seconds.

Both chains finished successfully.
Mean chain execution time: 41.8 seconds.
Total execution time: 43.7 seconds.

List of uses cases

It would be great to collect package use cases both in order for users to be inspired and also to motivate support for the package from people concerned with bookkeeping. It isn't clear exactly what form this could take. Some options:

A vignette with a free text list
A vignette with an interactive html table
A GitHub discussion section
Other

Ideally, it would be useful to have some general metadata about the use cases for example disease, purpose (research or situational awareness etc), package extension linked to (this could be a nice way maybe of motivating contributions).

Add tests for postprocessing

Add tests for post processing (using forecast.vocs as a model)

Support the Weibull distribution

The Weibull distribution is often used as a baseline hazard model in survival analysis. It also has a tractable form that can be specified directly as part of the hazard (though care is needed here for a discrete version)

https://en.wikipedia.org/wiki/Weibull_distribution

`enw_retrospective_data` potentially not working as expected

Second guessing as there is no function documentation but I'd expect

retro_nat_germany <- enw_retrospective_data(
  nat_germany_hosp,
  rep_days = 30, ref_days = 40
)

to return a data set with 30 reporting days and 40 occurrence dates. Instead I get

> length(unique(retro_nat_germany$report_date))
[1] 41
> length(unique(retro_nat_germany$reference_date))
[1] 41

~~It seems to me these two lines should both have a > sign, but perhaps I might be misunderstanding what is being done here~~

Looking at the use of this alongside enw_latest_data in the README it seems rep_days is intended as a shift back and then ref_days as the number of reference days to include backwards. In that case changing this to be a > gives length 40 (the same change would have to be made in enw_latest_data
https://github.com/epiforecasts/epinowcast/blob/a5616b15857d790eced0f1289b498d42fa0ca6d7/R/preprocess.R#L176

Warnings about deprecated array declarations

Compiling the model throws warnings about deprecated array declarations.

> library(epinowcast )
> model <- enw_model(threads = TRUE)
Compiling Stan program...
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/include_1/stan/functions/zero_truncated_normal.stan', line 8, column 32, included from
'/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 5, column 0: Declaration
    of arrays by placing brackets after a type is deprecated and will be
    removed in Stan 2.32.0. Instead use the array keyword before the type.
    This can be changed automatically using the auto-format flag to stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/include_1/stan/functions/obs_lpmf.stan', line 1, column 14, included from
'/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 7, column 0: Declaration
    of arrays by placing brackets after a type is deprecated and will be
    removed in Stan 2.32.0. Instead use the array keyword before the type.
    This can be changed automatically using the auto-format flag to stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/include_1/stan/functions/obs_lpmf.stan', line 1, column 47, included from
'/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 7, column 0: Declaration
    of arrays by placing brackets after a type is deprecated and will be
    removed in Stan 2.32.0. Instead use the array keyword before the type.
    This can be changed automatically using the auto-format flag to stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/include_1/stan/functions/obs_lpmf.stan', line 1, column 58, included from
'/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 7, column 0: Declaration
    of arrays by placing brackets after a type is deprecated and will be
    removed in Stan 2.32.0. Instead use the array keyword before the type.
    This can be changed automatically using the auto-format flag to stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v
830c00000gp/T/Rtmpb3d2EJ/include_1/stan/functions/obs_lpmf.stan', line 1, column 68, included from
'/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 7, column 0: Declaration
    of arrays by placing brackets after a type is deprecated and will be
    removed in Stan 2.32.0. Instead use the array keyword before the type.
    This can be changed automatically using the auto-format flag to stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/include_1/stan/functions/obs_lpmf.stan', line 2, column 14, included from
'/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 7, column 0: Declaration
    of arrays by placing brackets after a type is deprecated and will be
    removed in Stan 2.32.0. Instead use the array keyword before the type.
    This can be changed automatically using the auto-format flag to stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/include_1/stan/functions/obs_lpmf.stan', line 2, column 32, included from
'/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 7, column 0: Declaration
    of arrays by placing brackets after a type is deprecated and will be
    removed in Stan 2.32.0. Instead use the array keyword before the type.
    This can be changed automatically using the auto-format flag to stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/include_1/stan/functions/obs_lpmf.stan', line 2, column 42, included from
'/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 7, column 0: Declaration
    of arrays by placing brackets after a type is deprecated and will be
    removed in Stan 2.32.0. Instead use the array keyword before the type.
    This can be changed automatically using the auto-format flag to stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/include_1/stan/functions/obs_lpmf.s
tan', line 2, column 52, included from
'/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 7, column 0: Declaration
    of arrays by placing brackets after a type is deprecated and will be
    removed in Stan 2.32.0. Instead use the array keyword before the type.
    This can be changed automatically using the auto-format flag to stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/include_1/stan/functions/obs_lpmf.stan', line 3, column 43, included from
'/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 7, column 0: Declaration
    of arrays by placing brackets after a type is deprecated and will be
    removed in Stan 2.32.0. Instead use the array keyword before the type.
    This can be changed automatically using the auto-format flag to stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/include_1/stan/functions/obs_lpmf.stan', line 9, column 2, included from
'/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 7, column 0: Declaration
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 16, column 2: Declaration
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 17, column 2: Declaration
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before 
the
    type. This can be changed automatically using the auto-format flag to
    stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 18, column 2: Declaration
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 19, column 2: Declaration
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 20, column 2: Declaration
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 23, column 2: Declaration
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 24, column 2: Declaration
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.s
tan', line 25, column 2: Declaration
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    
type. This can be changed automatically using the auto-format flag to
    stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 28, column 2: Declaration
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 37, column 2: Declaration
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 49, column 2: Declaration
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 50, column 2: Declaration
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 51, column 2: Declaration
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', li
ne 52, column 2: Declaration
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 53, column 2: Declaration
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 54, column 2: Declaration
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 55, column 2: Declaration
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 68, column 2: Declaration
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 69, column 2: Declaration
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the

    type. This can be changed automatically using the auto-format flag to
    stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 70, column 2: Declaration
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 88, column 2: Declaration
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 178, column 2: Declaration
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 180, column 2: Declaration
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc
Warning in '/var/folders/gd/x84kkjzd6bn9rlf3f2v830c00000gp/T/Rtmpb3d2EJ/model-10e86bd5a0c6.stan', line 186, column 4: Declaration
    of arrays by placing brackets after a variable name is deprecated and
    will be removed in Stan 2.32.0. Instead use the array keyword before the
    type. This can be changed automatically using the auto-format flag to
    stanc

Hazard specification of parametric baseline hazard

At the moment the parametric baseline hazard is calculated by using the CDF of the distributions to get the discrete probability of report. It may be possible to instead directly specify this as a hazard. This may be optimal when combined with other hazard models (like proportional changes in hazard over time etc).

More efficient parametric approach

Is there a more efficient/natural method for defining logit hazard which is equivalent to the current approach via the pmf.

Rename group variable to be more unique

group and old_group may not be unique in users data. It is sensible to rename these in order to try and avoid this. As this is a breaking change ideally it should form part of the 0.1.0 release. As @sbfnk suggests .group seems like a good choice.

https://github.com/epiforecasts/epinowcast/blob/a48ae0a9ee0652d8b1ffbf6220484ec9c1de2114/R/preprocess.R#L352

I'd suggest .group and .old_group as column names to avoid conflict.

Originally posted by @sbfnk in #64 (comment)

Add function to compute empirical share of cases with missing reference date

Add function enw_add_share_missing to preprocess.R that computes the share of cases with missing reference date.

There are two main options:

The easy, approximate one: compute share by reporting date and shift backwards (by mean empirical backwards reporting delay estimated over all dates)
The complicated, more exact one: compute empirical backward reporting delay distribution for each reporting day, shift cases with missing reference date backwards according to the distribution to get their reference dates, compute empirical share by reference date

The complicated version could also be the more instable one as case numbers are low, because then the empirical backward delays cannot be estimated accurately. On the other hand, the easy version will be biased in situations where the delay changes a lot.

One further aspect to decide on is whether to add this to obs or somewhere else. Technically, the share of cases with missing reference date is only indexed by reference date, so it does not fit directly to the structure of obs...

Update citation

All authors need to be added to the package citation ahead of 0.1.0

Explore linkage with survival models

There is a clear linkage with survival models. Exploring this in more detail may give some insights into more effective/natural parametric approaches than that currently implemented. It may also suggest other optimizations as these models have a long history.

Extend to arbitrary time frames for input data

The package works on daily data, and daily data only. While most use cases that I can think of will indeed involve daily data, this is not necessarily the case universally, and it limits the scope of what the package can do if there is no way for time steps do be anything shorter (e.g. hour) or longer (e.g. month) than a day.

In order to be open to other options here in the future I would suggest to design the workflow such that all dates are converted to integer time steps at the preprocessing stage, and all steps following preprocessing use these integer times only - this would keep the code more modular and mean that if someone wants to implement other options later this only has to be changed in preprocessing (which in my opinion could be in a separate package anyway - but that is a separate discussion).

Support for missing reference dates MVP

Design motivation

Some datasets may include counts of cases with missing reference dates. For these cases, the date of reporting is known but the date of occurrence is for some reason unknown. It would be a valuable feature to support such data.

One way to account for such "missing data" in nowcasting is imputation during preprocessing. This is typically done by estimating the delay distribution from historical data and imputing reference dates by subtracting random draws of the delay distribution from the reporting date. Regression models can be used to estimate the delay distribution, accounting for changes over time, on weekdays, group effects etc. This however means that two potentially sophisticated models are involved (the imputation model and nowcasting model), and to quantify the overall uncertainty, a multiple imputation scheme is required which can be computationally demanding if the nowcasting model must be rerun. More importantly however, subtracting delays from reporting days does not necessarily reconstruct the underlying time series by reference date. This is the same issue as with back-projection approaches used in R estimation (see https://doi.org/10.1371/journal.pcbi.1008409).

It thus seems more favorable to support missing reference dates directly in the nowcasting model by implementing a generative model of missingness and adding a likelihood term for the cases with missing reference date. In the following, an MVP version for epinowcast of such a model is outlined.

Changes to the interface (preprocessing)

Implement an artificial/simulated dataset with cases with missing reference dates (NA) for testing and showcasing purposes. Ideally, also add a real-world dataset with cases with missing reference dates.
Adjust enw_preprocess_data to accept entries where the reference date is NA and gather them into a vector obs_miss that is indexed by reporting day. Add obs_miss to the returned enw_preprocess_data object.
Adjust enw_obs_as_data_list such that the obs_miss in pobs are also returned in the data list for stan.

Changes to the model

Add int obs_miss[rd] to the data block. This array will contain the number of cases with missing reference date reported on a certain reporting date.
Add vector<lower=0,upper=1>[t] alpha[g] to the transformed parameters block. This will represent the share of cases with (eventually) recorded reference date information that occurred on a given reference date for a given group. In turn, 1-alpha will represent the share of cases with (eventually) missing reference date information.
Implement a per-group, non-centered random walk prior on the logit scale for alpha. This will require corresponding parameters for the raw random walk. Also, add real eobs_miss_lsd_p[2] as priors for the random walk sd to the data block. The random walk will be similar to https://github.com/epiforecasts/epinowcast/blob/20274a8bb7effcae76d35cbcfd3eba5186d34dc3/inst/stan/epinowcast.stan#L106-L116
Remove the currently implemented map-reduce structure from the likelihood. The exp_obs from all reference dates (for a given group) are needed to compute the expectation for obs_miss by convolving over all maxDelay days before a reporting date. Hence it make sense first implement a working version without within-chain parallelization and then add it back at a later stage.
Add a multiplication of exp_obs with the corresponding alpha to get the expected numbers of cases with (eventually) recorded reference date information for the different delays, before putting it in the likelihood for obs: https://github.com/epiforecasts/epinowcast/blob/20274a8bb7effcae76d35cbcfd3eba5186d34dc3/inst/stan/functions/obs_lpmf.stan#L17
Add a multiplication of exp_obs with the corresponding 1-alpha to get the expected numbers of cases with (eventually) missing reference date information for the different delays. Then add them to a vector exp_obs_miss that is indexed by reporting day (matching the reporting day with the right delay from the current reference date). Looping over the different reference dates (exp_obs_miss initialization is outside of the loop), this will realize the convolution.
Implement a second likelihood term (poisson or neg_binomial) for obs_miss given exp_obs_miss as expectation. IMPORTANT: only include the likelihood for obs_miss at t_0 + maxDelay or later! Earlier obs_miss are not supported in the MVP version since this would require an extension of the model to times before t_0.
If a neg_binomial likelihood is used and an independent overdispersion for the missing data is desired, add a phi_miss parameter with suitable prior (having partially pooled overdispersion parameters would probably be overengineered).
Adjust generated quantities in a similar way to the likelihood computation such that pp_obs_tmp only represents the cases with known reference date and add a pp_obs_miss_tmp for the cases with missing reference date. Analogously, distinguish between pp_inf_obs and pp_inf_obs_miss. Adjust the log-likelihood computation such that both likelihoods are taken into account.

Changes to the interface (postprocessing)

Adjust plotting and summary functions such that the different generated quantities for known and missing cases are handled separately.
Optionally add support for plotting the estimated shares of known/missing cases.
Ensure that only the observations and predictions for the cases with known reference date are used in the evaluation procedures.

Further remarks

The generative model suggested here comes with the standard "missing-at-random" assumption, i.e. the reporting delay is assumed to be independent of the missingness.
Splitting up the likelihood into cases with known and with missing reference date is only theoretically justified if we assume that reported cases are Poisson distributed (since the sum of Poisson distributed RVs is also Poisson). Switching this to negative binomial means some slight inaccuracies in that regard. However this does not mean that using a negative binomial cannot offer improved performance on real-world data. It is only less principled.
As noted above, this MVP only supports obs_miss at t_0 + maxDelay or later. Nevertheless, I suggest to write the whole interface as if also earlier observations could be included, because we can then extend the model later on without touching the interface again. This only means that the details of the likelihood should be clearly communicated to the user.

Add flexible forecasting model

At the moment the expected notification model is hardcoded to be an independent random walk for each group but there is no strong reason for this. The package functionality for design matrices etc should be easily resuable to firstly default to the current model but also to support many other model formulations. The steps to this are:

Add a combine_effects step in the stan code
Add forecasting design matrices
Remove current hardcoded random walk and replace with combined effects
Add forecast model effects input to epinowcast and enw_as_data_list
Add new inits and stan data list
Add a helper function to make it easier to specify an independent random walk.

This would bring the specification of the forecasting model into line with the specification of the delay distribution parameters and the non-parametric day of report model.

Make specific issues for areas of the package missing documentation and examples.

Areas of the package are missing documentation and examples. All functions which have placeholder documentation need these replaced. The first step to doing this is to split out this task into sub-tasks based on areas of the package (for example by .R file). Note that the most up to date work on this is on the develop branch.

Add future forecasting ability

Add the ability to extend beyond the last observed data. This requires:

Extending enw_preprocess_data() to have a horizon argument. This needs to extend the metadata by this number of days and the observed data
Update the stan code to skip unobserved future dates in the likelihood
Check support in postprocessing and summary functions
Update quick to include this.

More informative error for negative new confirmed cases

When the difference between cumulative confirmed cases for a certain reference date is negative, enw_reporting_triangle will throw an error. However, there could be two different ways to arrive at such a scenario:

the reporting data indeed has "negative" reports for certain reference dates, for example due to corrections
the user has accidentally provided data in a wrong format, i.e. "confirmed" is actually "new confirmed" already

I am not sure how much we want to "shepherd" the user, but it could be useful to check for the latter case and throw a different error message than in the first case.

Bug: When profile = TRUE generated quantities are not present

When profile = TRUE and profiling is in use as introduced in #10 then generated quantities are no longer stored. This means that nowcasts etc do not exist and plotting is not possible. As this feature is only used during development for a limited range of purposes this is not a must fix. Flagging @adrian-lison who implemented #10 and may have some insights.

The problem can be seen by running the readme example code with profile = TRUE in enw_model.

Profile stan code to increase efficiency

Using profiling as supported in cmdstanr to identify which areas of the current model are long-running. Look through the stan documentation to identify if there optimizations that can be made to improve performance.

Update model syntax to new cmdstan version

The new cmdstan version has rolled out some syntax updates meaning that the model now throws warnings when complied. These currently have no performance impact but will cause an issue in future versions of stan.

Add support for Gaussian processes

Placeholder whilst this gets fleshed out. GPs for nowcasting have been shown to work well and are robustly implemented in stan in a range of places. Please feel free to jump in with resources and discussion.

Sources:

https://proceedings.mlr.press/v161/hawryluk21a/hawryluk21a.pdf
https://www.science.org/doi/10.1126/science.abh2644
https://github.com/ihawryluk/GP_nowcasting/blob/main/stan_models/2comp_longshort_SE.stan
https://paul-buerkner.github.io/brms/reference/gp.html
https://github.com/epiforecasts/EpiNow2/blob/time-varying-secondary/inst/stan/functions/gaussian_process.stan

Add support for approximate leave future out.

At the moment the package supports using loo to produce estimates of the leave one out information criterion. This is not appropriate for most nowcasting tasks where the aim is to nowcast future data. This means that most model selection needs to be done by fitting multiple nowcasting models and evaluating the results. Doing this is computationally costly and not tractable for many users

Instead recent work on approximate LFO may be useful but this requires implementing on a model-specific basis. However, it is likely it can be implemented for at least the family of models supported in epinowcast. Potentially, some theory work is needed here as we are estimating multiple targets at once for each nowcasting vs a single time point as given in the examples below.

Resources:

https://mc-stan.org/loo/articles/loo2-lfo.html
https://arxiv.org/abs/1902.06281

Add non-parametric reference date model

Currently only the parametric reference date model has been implemented. For completeness it would be useful to update the model to include the non-parametric component defined in the model definition vignette.

This requires:

Adding an effects argument and data list to the input
Adding new initialisation parameters
Adding another layer of logit hazards in the stan code
Checking the quick start and adding commentary on this approach to the case study.
Make it possible to only have a non-parametric model
Consider what the default model should be?

epinowcast / epinowcast Goto Github PK

epinowcast's Introduction

Flexible Hierarchical Nowcasting

Summary

Installation

Resources

Contributing

How to make a bug report or feature request

Code of Conduct

Citation

Contributors

Code

Issue Authors

Issue Contributors

epinowcast's People

Contributors

Stargazers

Watchers

Forkers

epinowcast's Issues

Design motivation

Changes to the interface (preprocessing)

Changes to the model

Changes to the interface (postprocessing)

Further remarks

Recommend Projects

Recommend Topics

Recommend Org