appliedepi / epirhandbook_eng Goto Github PK

View Code? Open in Web Editor NEW

89.0 89.0 55.0 453.13 MB

The repository for the English version of the Epidemiologist R Handbook

License: Other

TeX 0.01% CSS 0.20% R 0.06% HTML 95.87% JavaScript 3.27% SCSS 0.30% Less 0.29%

epirhandbook_eng's People

Contributors

Stargazers

Watchers

Forkers

anhnguyendepocen ariff118 dawei-zhu85 liguangming309 hkejigu alturabi1990 babayoshihiko scharlottej13 blubradley morandiaye phgis dsconnell arsyiloo aakritisharmaa reneherrera-cocohhs tabs11 virologist mohanbabu19 banamwana vivianso studentmicky nirajpaudel14 nikolaospapachristou elvasileiou thuanpin quisqualis charmarshall2 shaunson26 pitmonticone shahronak47 soehtetaung12784 beansrowning hyan87 kylehaynes cram070 alienzj aocejogarcia patrick89q enenchao amataste meenakshi-kushwaha tibaredha paujedynak artlesshao d-morrison liascott cienciadedatosysalud aaronosb23 mephistofeline imufumba cem-gulec robcrystalornelas lizacoyer

epirhandbook_eng's Issues

update sankey diagrams

once this package hits cran we should consider updating the sankey diagrams in the handbook
https://github.com/davidsjoberg/ggsankey

Bad links into common errors Rmd

Into the pages Common errors, the links of interpreting error messages doesn't work.
In line 11 says StackExchange.com, stackoverflow.com, community.rstudio.com
and should be StackExchange.com, stackoverflow.com, community.rstudio.com

Review GIS page

need to update basemap section to align with case studies (ggplot friendlier packages + offline saving)
link to / incorporate material from rspatialdata

Stats page updates

follow up with mark from melbourne about epiR updates to stats page
work on gtsummary wrapper function for stratified analysis
show how to add counts to a regression table with {gtsummary} as detailed here

Restructuring website for translations

Create a new repo that pulls all the html outputs from translations into subfolders so that website can be built on a single domain with landing page.
New bookdown version has option to specify folder for output - so worth considering a wrapper script for rendering.
Also need to consider how this will interact with #8

Signup for github beta

We should probably consider signing up for the beta version of github project planning .
Seems quite functional and from the FAQ seems will be a free version... thoughts?

Is something wrong or missing in epicurves.Rmd ?

Hola,

When rendering epicurves.Rmd (alone or the entire book). Gives that error.
I am using R 4.1.1 and incidence2 1.2.1 and later incidence 1.2.2

Error in (function (cond) :
error in evaluating the argument 'x' in selecting a method for function 'plot': cumulate() was deprecated in incidence2 1.2.0 and is now defunct.

See the 1.2.0 NEWS for more information

lines 757-760 epicurves.Rmd

plot cumulative incidence

wkly_inci %>%
cumulate() %>%
plot()

Code improvement: Pivoting

Hi there,
I appritiate for your work on epirhankbook based tidyverse of R and learned a lot.
When I get to the part of https://epirhandbook.com/pivoting-data.html#pivoting-data-of-multiple-classes,
these codes like:

df_long <-
df %>%
pivot_longer(
cols = -id,
names_to = c("observation", ".value"),
names_sep = "_"
)

df_long

df_long <-
df_long %>%
mutate(
date = date %>% lubridate::as_date(),
observation =
observation %>%
str_remove_all("obs") %>%
as.numeric()
)

df_long

df_long <-
df %>%
pivot_longer(
cols = -id,
names_to = c("observation", ".value"),
names_sep = "_"
)

I would like to recommand the more ideal code as follows:

#Import data
obs <-
structure(
list(
id = c("A", "B", "C"),
obs1_date = c("2021-04-23", "2021-04-23", "2021-04-23"),
obs1_status = c("Healthy", "Healthy", "Missing"),
obs2_date = c("2021-04-24", "2021-04-24", "2021-04-24"),
obs2_status = c("Healthy", "Healthy", "Healthy"),
obs3_date = c("2021-04-25", "2021-04-25", "2021-04-25"),
obs3_status = c("Unwell", "Healthy", "Healthy")),
row.names = c(NA,-3L),
class = c("tbl_df", "tbl", "data.frame"))

#Tidy data
obs %>%
pivot_longer(
2:last_col(),
names_to = c("obs", ".value"),
names_pattern = "obs(.)_(.+)",
names_transform = list(obs = as.integer),
values_transform = list(date = as.Date))

Best wishes,

Tony

From email from [email protected] to [email protected] on 8 August

Error in standardised rates page

Thanks very much to Thuan from the Vietnamese team for pointing this out:

The code of the 'Standardised rates' page does not work well. Finally, I found that a minor error in the code

# Remove specific string from column values
standard_pop_clean <- standard_pop_data %>%
     mutate(
          age_cat5 = str_replace_all(age_cat5, "years", ""),   # remove "year"
          age_cat5 = str_replace_all(age_cat5, "plus", ""),    # remove "plus"
          age_cat5 = str_replace_all(age_cat5, " ", "")) %>%   # remove " " space
     rename(pop = WorldStandardPopulation)   # change col name to "pop", as this is expected by dsr package

Only change age_cat5 by AgeGroup in the first row of mutate(), everything works well, see highlighted text bellow.

standard_pop_clean <- standard_pop_data %>%
     mutate(
          age_cat5 = str_replace_all(AgeGroup, "years", ""),   # remove "year"
          age_cat5 = str_replace_all(age_cat5, "plus", ""),    # remove "plus"
          age_cat5 = str_replace_all(age_cat5, " ", "")) %>%   # remove " " space
     rename(pop = WorldStandardPopulation)   # change col name to "pop", as this is expected by dsr package

add ggplot reorder_within() for facets

https://juliasilge.com/blog/reorder-within/

add detail to rounding in R Basics

Use of accuracy = 0.1 or 0.01 within scales() used on a normal data frame, to adjust the number of decimal places shown.

see https://stackoverflow.com/questions/53072282/how-to-prevent-scalespercent-from-adding-decimal

Saving multiple plots within purrr map()

# use the code below to automatically make epidemic curves by communue, for EVERY Department (iteration)
# The plots are saved into the "png" folder in the R project.

# Define vector of unique department names
dept_names <- palu %>% 
  pull(Département) %>% 
  unique() 
  
# "map" a function across each of the department names.
# The function filters the dataset for the department and creates/saves the plot

dept_plots <- palu %>%             # begin with the complete dataset
  group_split(Département) %>%     # split into different datasets by Departement
  purrr::map(~ggsave(              # the function that is iterated is ggsave() to save the plot
    
    # within the ggsave(), the file name is created as:
    filename = here::here(                                        
      "png",                       # the folder "png"
      str_glue("paludisme_par_mois_{first(.$Département)}.png")), # and then a dynamic file name that contains the department name.
    
    # and the plot to save is created from the split data, created above (.x)
    plot = ggplot(data = .x,
                  mapping = aes(x = mois, y = nombre_cas, group = annee))+
           geom_line(aes(color = annee), size = 2, alpha = 0.6)+
           facet_wrap(~Communes, scales = "free_y")+                  # one plot per commune
           labs(
             y = "Nombre de cas",
             x = "Mois",
             title = str_glue("Paludisme, Department {first(.x$Département)}"))+
           theme_minimal(16)+
           theme(axis.text.x = element_text(angle = 90)),
    
    width = 18,
    height = 10
    ) # end ggsave() 
    ) # end map()

34.3 ordered y axis code doesn't change the y-axis order

In the ordered y axis section of 34.3, the below code is used to order the location_name variable, however, this does not work. The location_name variable needs to be converted to a factor first and then fct_relevel will do what is expected.
You'll see that the current figures on the webpage with/without an ordered y-axis are identical.

Maybe also better to put the p_load(forcats) to the start of the chapter rather than in this chunk

`load package
pacman::p_load(forcats)

create factor and define levels manually
agg_weeks <- agg_weeks %>%
mutate(location_name = fct_relevel(
location_name, facility_order$location_name)
)`

move issues from archive

just so we dont forget to address all the leftover issues in the archive repo

Update github links in translations

All translations need updated Github links:

Github link on Welcome page
Almost all the links in the download data page

Use Find in Files tool to replace all:
epirhandbook/Epi_R_handbook to:
appliedepi/epirhandbook_eng

all the various pages that link to data (opening paragraphs)

37 Transmission chains - Error with subset()

In section 37.3 Handling, the subset code produces an error:
sub_attributes <- subset( epic, node_attribute = list( gender = "m", date_infection = as.Date(c("2014-04-01", "2014-07-01")) ), edge_attribute = list(location = "Nosocomial") )

Error in FUN(X[[i]], ...) :
Value for date_infection is not found in dataset
In addition: Warning message:
In if (!attribute %in% data) stop(paste("Value for", name_attribute, :
the condition has length > 1 and only the first element will be used

It looks like this is because the dates are stored as datetime when reading the "linelist_cleaned.xlsx" dataset.

Adding the line below when reading the data fixes the problem:

linelist <- rio::import("linelist_cleaned.xlsx") %>% mutate(across(starts_with("date"), as.Date))

Working with dates

both months and minutes are written with the same. strptime, line 106 - 122, file dates.Rmd

Setup github actions

book a training date with bassem and those interested
write github actions so that any PR merged to the english repo (ideally with a "For translation" tag") triggers an issue to be created in all other language repos

Epicurves updates

Epicurves page:

Add to all ggplot chunks that use a breaks = statement
closed = "left", # count cases from start of breakpoint

Add bullet to explain its use, in Weekly Epicurve Example section

We use closed = "left" in the geom_histogram() to ensure the date are counted in the correct bins

Edit the multi-line date labels to include labels = scales::label_date_short().

Remove one of the old examples using faceting.
Edit the green faceting example to remove the facet panel border.

Also this is updated in the ggplot tip page section on date axis.

Updates for all translations

Technical updates in pages, that enable rendering:

Cleaning page: change last R chunk at bottom (already echo=F) to eval=F. This prevents error that "file already exists" when saving the cleaned dataset.
GIS page, add drop_na(objectid) %>% to chunk that creates case_adm3_sf ~ line 605 # drop any empty rows. This prevents error about discrepancy between 9 and 10 rows.
GIS page: add OpenStreetMap:: around line 730 to openmap() command
Epicurves page: remove text and R code about cumulative for incidence2, which is now deprecated

Welcome page

Add donate button.
Add banner in language, and translators
Add "Languages:" in bold, just below top four bullet points on welcome page. Add links to English and others. Be sure to use https:// in front in the link.
Adjust Applied Epi descriptive text
Add Applied Epi logo to replace epiRhandbook logo
Ensure Core Team is subsumed into Authors and Supporters on welcome page

Epicurves page
Fixing issue #56 with these changes. Also need to add a sentence referring people to the Dates page if they are reading in an Excel file.

Typo: Code assists in R Basics

Update language repo names

Nice to have but:
in the process of reducing the repo size (#6) we could also consider standardising the repo language name extensions to either be 3-letter ISO codes or country domains

Clarify data paths for time series

Weather data is missing. The codes chunks for Fitting regressions are returning error messages.

Typo: strptime syntax table

One of these should be capital M

Renv setup

Renv snapshot to be taken by @nsbatra and emailed to @isaac-florence
Setup function wrapper for renv in {epirhandbook}
Update data page of the handbook

Inconsistent code appearance in 42.5

You can normally copy/paste most chunks of code from the book to R studio, but this is different in much of the page on flexdashboards. Unclear if this is intentional but ideally good to have chunks of code that can be copied rather than as images.

Unnecessary message appearing under section 34.3

Minor point, there is a message appearing just before "Create heat plot" section in section 34.3
It says ## Joining, by = c("location_name", "week") and it appears as code that a user could copy/paste to R.
Probably good to just remove this message.

New content for Factors page

https://juliasilge.com/blog/reorder-within/

We should add something on this, perhaps at the end of the page as an advanced tip (or in ggplot2 tips page)

Cannot find the survey data used in 26.2

Are there files available to run through the survey analysis section?
I couldn't find them after installing the handbook/or on github.
It would be helpful if such files were available to ensure users are getting the same results as presented online.

#import the survey data survey_data <- rio::import("survey_data.xlsx")

import the dictionary into R survey_dict <- rio::import("survey_dict.xlsx")

dates page - add janitor function excel_numeric_to_date

https://garthtarr.github.io/meatR/janitor.html

Update alluvial plots to ggalluvial

Data begins in linelist format, with each row having a classification. Then transformed to "wide" with one column per time (values the classifications), plus one column for counts. Then transformed to long with ggalluvial-specific function.

Data preparation:

ELR_wide <- ELR %>% 
  pivot_wider(
    id_cols = c(country, who_region),
    names_from = week,
    values_from = final) %>% 
  drop_na() %>% 
  group_by(across(contains("2021"))) %>% 
  count() %>% 
  ungroup() %>% 
  mutate(across(.cols = -n, 
                .fns = ~recode(.x,
                              "1 - Critical" = "Critical",
                              "2 - Very high" = "Very high", 
                              "3 - High" = "High",
                              "4 - Medium" = "Medium",
                              "5 - Low" = "Low",
                              "6 - Minimal/NA" = "Minimal",
                              "No Data" = "No Data"))) %>% 
  mutate(across(.cols = -n,
                .fns = ~fct_relevel(.x, c(
                  "Critical",
                  "Very high", 
                  "High",
                  "Medium",
                  "Low",
                  "Minimal",
                  "No Data")))) %>% 
  mutate(across(.cols = -n,
                .fns = fct_rev))

#levels(ELR_wide$`2021-07-19`)


library(ggalluvial)
#is_alluvia_form(as.data.frame(ELR_wide), axes = 1:3, silent = TRUE)


ELR_long_alluvial_original <- to_lodes_form(data.frame(ELR_wide),
                              key = "date",
                              axes = 1:6) %>% 
  mutate(date = str_replace_all(date, "X2021.", "")) %>% 
  mutate(stratum = fct_relevel(stratum,
                                    c("Critical",
                                      "Very high", 
                                      "High",
                                      "Medium",
                                      "Low",
                                      "Minimal",
                                      "No Data")))

# add current ELR classifications
ELR_long_alluvial <- left_join(
  ELR_long_alluvial_original,
  ELR_long_alluvial_original %>%
              select(-n) %>% 
              filter(date == max(date, na.rm=T)) %>% 
              rename(
                current_date = date,
                current_ELR = stratum),
            by = c("alluvium" )) 

# add earliest ELR classifications
ELR_long_alluvial <- left_join(
  ELR_long_alluvial,
  ELR_long_alluvial_original %>%
              select(-n) %>% 
              filter(date == min(date, na.rm=T)) %>% 
              rename(
                oldest_date = date,
                oldest_ELR = stratum),
            by = c("alluvium" ))

data visualization with ggalluvial

# plot showing current classification
ggplot(data = ELR_long_alluvial,
       aes(x = date, stratum = stratum, alluvium = alluvium,
           y = n, label = stratum)) +
  geom_alluvium(aes(fill = current_ELR)) +
  geom_stratum() +
  geom_text(stat = "stratum", size = 1) +
  theme_minimal() +
  scale_fill_manual(
    values = c(
      "Critical" = "black",
      "Very high" = "darkred",
      "High" = "red",
      "Medium" = "darkorange",
      "Low" = "darkgreen",
      "Minimal" = "green",
      "No Data" = "grey"))+
  labs(
    title = "[INTERNAL] ELR classification trajectories\ncolored by current classification (global)",
    fill = str_glue("ELR classification\non {max(ELR_long_alluvial$date, na.rm=T)}"),
    caption = "Last 6 weeks of data")

Surveys page updates

New to add:

group_by estimates e.g. for strata
weighted regression
isidro's cluster weighting code to account for diff population sizes by cluster

More thorough explanation of barplot handling

https://www.aj2duncan.com/blog/missing-data-ggplot2-barplots/
https://stackoverflow.com/questions/10834382/ggplot2-keep-unused-levels-barplot

Also make more clear in Factors page how o keep all levels in a plot. Note you may need to use scale_x_discrete drop=F, as well as scale_fill_ drop=F

how to deal with some values missing in some facets of barplots.

the solution that worked for me was:
geom_col(position = position_dodge(preserve = 'single'))+

Trouble with EpiNow2 Code

The EpiNow2 code chunk below returned an error message:

run epinow

epinow_res <- epinow(
reported_cases = cases,
generation_time = generation_time,
delays = delay_opts(incubation_period),
return_output = TRUE,
verbose = TRUE,
horizon = 21,
stan = stan_opts(samples = 750, chains = 4)
)

Error Message:
Logging threshold set at INFO for the EpiNow2 logger
Writing EpiNow2 logs to the console and: C:\Users\AKARST1\AppData\Local\Temp\RtmpcZOade/regional-epinow/2015-04-30.log
Logging threshold set at INFO for the EpiNow2.epinow logger
Writing EpiNow2.epinow logs to the console and: C:\Users\AKARST1\AppData\Local\Temp\RtmpcZOade/epinow/2015-04-30.log
Error in seq.int(0, to0 - from, by) : 'to' must be a finite number

Reduce size of all language repos

Follow this sequence of actions (see two linked issues below):

Languages to complete:

Typo: import export

poorly formatted x-axis labels

in section "31.8 Highlighting" the x-axis labels are poorly formatted such that one can't read them clearly. !

Same problem in "31.11 Dual axes"

image

Could have a piece on how to correctly format labels?

Offline versions for translations

Thanks very much Thuan from the Vietnamese team for point out that the offline version of the vietnamese translation is not available from the website. (think this just needs to be knit?)

The Vietnamese version does not have an offline version although it is not necessarily required.

Error in 42.6 plots - not possible to make the interactive epicurve

ggplotly no longer supports levels.grates_yearweek() which the incidence package uses and so the code for the interactive plot does not run.

It doesn't work with other packages - tsibble or aweek. Weeks or months are not supported by ggplotly

Add donate button to other lang repos

Currently none of the language translations have the donate button on the homepage.

Consider Github Large File Storage (LFS)

As suggested by Bassem we might want to look at the pricing plans for github large file storage if our clean up in (#6) still leaves us with a heavy load.

translations in R basics page

Somehow there were some translated sentences that went live in the R Basics page. Need to be removed and re-rendered.

GIS and local information

@yuriei @ishaberry @ebuajitti and @aspina7 @hitomik723
This is a copy of an issue from epiRhandbook_jp regarding gis.Rmd (28 GIS basics):
appliedepi/epiRhandbook_jp#5

Geographic Information System (GIS) is geography-dependent, i.e. the national/local governments may maintain their coordinate systems and provide data such as national census and buildings. I suggest adding local information to the Japanese version of GIS.Rmd.

Below is a sample of additional information for readers in Japan.

Key terms

Visualizing spatial data {.unnumbered}

CRS used in Japan

* Japan Plane Rectangular CS I to XIII, EPSG: 2443-2455
* JGD2011 GRS80 ellipsoid, EPSG:6668

Getting started with GIS

GIS data available for Japan

*  [Census](http://e-stat.go.jp/SG2/eStatGIS/page/download.html) population by address
*  [Suuchi](http://nlftp.mlit.go.jp/ksj/) - a variety of features (e.g. hospitals, schools, school districts)
*  [Kiban](http://www.gsi.go.jp/kiban/etsuran.html) - e.g. building and road edges
*  [Geospatial Japan](https://www.geospatial.jp/ckan/dataset) - a portal of GIS data by local governments and others

Preparation

Sample case data

In the code, the SRID=4326 is used. This is mainly for USA. Japanese government has maintained its own coordinate system and revised in 2000 and 2011 (the latter following the Great East Japan Earthquake).

| Coordinates | EPSG | Region |
| :--- | :--- | :------- |
| lat/lon | 4326 | US |
| lat/lon | 3857 | Web |
| lat/lon | 4612 | Japan (JGD2000) |
| lat/lon | 6668 | Japan (JGD2011) |

Relationships Dataset Missing

The relationships dataset in Contact Tracing section is missing.

40.4 Incomplete explanation of knitr option `child=`

Found a 'note to self' under File structure, subsection 'Source other files' (in chapter 40 section 4). It looks like the explanation for knitr option child= is incomplete or something was meant to be added here at a later stage. See image.
Thank you for checking this!

e

Typo: ggplot errors in common errors page

Typo and sentence syntax issue. I think this is in the "common errors" page

Typo: Push and pull, main vs master

with thanks to @babayoshihiko.

collaboration page L.674 should read: "PUSH - Clicking the green "Push" icon (upward arrow). You may be asked"
collaboration page L.734 should read: "git push # Push local commits of this branch to the remote branch"
also consider switching "master" to "main" in line with new approach outlined by github (unsure if this has been universally implemented)

Not specific link in Shiny chapter

Please see in the attachement:

Typo: Factors opening sentences

Email from [email protected] to epiRhandbook on 28 July

Hello,

Thank you very much for providing the public with The Epidemiologist R Handbook. I am a relative novice in R but am reaching out to clarify the definition for factors that you list here: https://epirhandbook.com/factors.html

The use of "ordered" and "order" in the first and second sentences, respectively, seem to suggest that factors in R are only useful for ordinal variables. Can you clarify this? I could see how this may throw some users off.

Thank you again for this wonderful reference text.

Ryan
Ryan S. Babadi, PhD, MPH
Postdoctoral Research Fellow
Department of Environmental Health
Harvard T.H. Chan School of Public Health

Hi Ryan,

Good to hear from you - I'm glad you are finding the Handbook helpful. Thanks for writing to us about this wording, I appreciate the detailed feedback.

You are correct in identifying that one can convert a column to class factor without defining an order to the factor levels. This would presumably be with the intention of setting a limited range of acceptable values. However, in my experience with R in applied epi, the vast majority of use cases for factors are centered around specifying the level order, so that's what I focused on in writing the opening text to this page.

I've made a note to revise this language in version 2 of the Handbook, to be more clear. Thanks again for writing to us,
Neale

Error in Chapter 32 Epidemic Curve

Hi there,

One of our Japanese translators (@KoKYura) let me know an issue in the R code chunk below in Chapter 32 Epidemic Curves.

epi_day <- incidence( # create incidence object
x = linelist, # dataset
date_index = date_onset, # date column
interval = "day" # date grouping interval
)
Error: Not implemented for class POSIXct, POSIXt

The error occurs because the date fields are imported as "POSIXct" in the previous code that uses import() function.
I was wondering if we could set the variable class as date by using readxl::read_excel() rather than import() function? Or we could convert the POSIXct variables to the date class right after the dataset is imported, using some code like this:

linelist <- import("linelist_cleaned.xlsx")

linelist[] <- lapply(linelist, function(x) {
if (inherits(x, "POSIXct")) as.Date(x) else x
})

Please let me know if I can help you with modifying the code.

Thank you,
Hitomi