public-health-scotland / hsmr Goto Github PK

9.0 5.0 1.0 10 MB

Reproducible Analytical Pipeline of the Hospital Standardised Mortality Ratio (HSMR) quarterly publication

R 100.00%

rap

hsmr's Introduction

Hospital Standardised Mortality Ratios

The Hospital Standardised Mortality Ratios publication is a quarterly publication that has been fully RAP'd. The entire process is contained within this R package and git repository.

Resources

A Quick Guide to git & GitHub

Folder Structure

All the publication files and folders are stored in the following directory:

/.../quality_indicators/hsmr/quarter_cycle

This directory should contain:

A "master" folder
A folder named after each analyst who has worked on the publication e.g. a folder called "David", "Lucinda", "Robyn" etc.

The "master" folder

The master folder is the master copy of the publication repository. This is the "production-ready" version that is used for the publication process each quarter. Within it there will be a folder called "data" which contains the output files/basefiles for all previous publications. The master copy should never be edited and should only be updated from approved changes pulled from GitHub.

The individual analyst folders

These folders also contain up-to-date copies of the repository and these are the versions which are edited each time the publication is updated or each time a change to the process has to be made. Analysts should only work in their own folders on their own development branches. Once they are content that their changes are ready for the master branch, they must create a pull request on GitHub and have other analysts from the team review their changes and, if satisfied, merge them back into the master branch. It is then that the master folder is updated.

The Repository

Files and Folders

The HSMR publication process has been wrapped up inside an R package and so a number of files/folders within the repository are there to facilitate this.

Folders

.git: This is the folder containing the version control history of the repository. It can be safely ignored.
.Rproj.user: Where project-specific temporary filse are saved. This can be safely ignored.
data: This is where the output is saved. This folder is not tracked by git and so it is safe to have data stored here.
man: This is where the documentation files for the R package are stored. E.g. the help files for the completeness function.
markdown: Where the markdown files and output are stored.
R: Where the functions are stored.
reference_files: Reference files/lookups specific to the package and not generic files are stored here e.g. the primary diagnosis groupings lookup.
tests: Testing files for the package functions are saved here. This can be safely ignored.

Files

.gitignore: Any files that should not be tracked by git should be added to this file.
.Rbuildignore:: This can be safely ignored.
setup_environment.R: This is the script which gets edited each quarter to update dates and lookup files.
create_smr_data.R: This is the script which uses the package to produce the SMR data for the publication.
create_trends_data.R: This is the script which uses the package to produce the long term trends data for the publication.
create_excel_tables.R: This is the script which uses the package to produce the excel tables.
DESCRIPTION: This is metadata for the R package. If the package is ever updated, the version number should be updated here.
NAMESPACE: Namespace file for the package. Can be safely ignored.

Functions

These can be located in the R/ folder.

smr_wrangling.R: This is the initial data wrangling for the HSMR process.
smr_pmorbs.R: Creates the pmorbs and n_emerg variables.
smr_model.R: Runs the logistic regression model for HSMR.
smr_data.R: Carries out aggregation and calculates HSMR figures for all relevant hospitals.
create_trends.R: Calculates crude mortality (%).
clean_model.R: Removes any surplus data from the logistic regression model (just to make more efficient use of memory when calculating probabilities).
completeness.R: Generates completess text for publication document.
file_sizes.R: Generates file size text for publication document (obsolete).
funnel_text.R: Generates text for main points from the funnel plot data.
mit_available.R: Calculates key date from publication and generates relevant text for the publication report.
nhs_performs.R: Reformats data in order to fit the required format of the NHS Performs platform.
pub_date.R: Calculates the publication date(s) for the HSMR publication.
qtr.R: Generates text to label quarters.
sql_ltt.R/sql_smr.R: SQL Queries for data extract.
submission_deadline.R: Calculates date data are complete to.
yr.R: Generates text to label 12-month period.

Running the publication

Updating the code

The package is designed to require as little human intervention as possible. To update the publication each quarter, the analyst responsible for updating the scripts/running the publication should complete the following steps:

Pull the most recent version of the master branch into their own folder
Create a fresh branch to make necessary changes
Update the dates in the setup_environment.R file
Check filepaths for lookups in create_smr_data.R and create_trends_data.R are still correct
Push new branch to GitHub
Create pull request for another analyst to review changes
Once changes have been approved, merge the branch into the master and delete
If no more changes are required, pull the updated master branch into the master folder

Running the code

In the master folder, open up create_smr_data.R, highlight the entire script and run
Check for any errors and investigate if necessary
Check the output in data/output looks as it should
In the master folder, open up create_trends_data.R, highlight the entire script and run
As above, check for any errors and look at the output to see if it looks as it should do
Open create_excel_tables.R, highlight the entire script and run. This script pulls in Excel templates which can be found in the reference_files folder (without data). The output files are saved in the data/output folder.

Once this step is done, the raw data files and Excel tables for the publication have been produced. The final step is knitting the markdown documents, but that can't be done until the completeness figures are available. Once that is done:

In the master/markdown folder, open both .Rmd scripts and click "knit"
Check output
A couple of manual steps are required to finish off the markdown documents (adding cover page, table of contents and formatting tables correctly). They are outlined in the readme in the National Statistics Publication Templates repository.

The raw output files (.csv files with the numbers) from this process also feed into the Tableau dashboard, so once they are ready, they should be moved to the appropriate folder.

The raw output files (csv datafiles, basefiles) all have the publication date in the name, so there is no need to archive as each time the process is re-run, new files will be created. The only files which do get overwritten are the publication document files, but these are copied over to the publication folder as part of the normal publication process so are already archived.

hsmr's People

Stargazers

Watchers

Forkers

jackhannah95

hsmr's Issues

.gitignore

This can wait until the very end, but should we write a more sophisticated .gitignore that allows for the reference files to be up but explicitly ignores the output files (rather than just ignoring a 'data' folder)? Maybe using regex?

Highlight sections that should be checked for manual updates

Need to think of a way to automatically highlight/give a reminder of sections that need checked for manual updates e.g interpretations of trend analysis. Possibly add information to the documentation which highlights key sections that should be checked?

Document/unit test yr()

Recording this as an issue so it isn't lost in the shuffle.

Add index functionality

HSMR can be calculated monthly, quarterly or annually. There's an index argument on the smr_model() and smr_data() functions that should be used to allow the user to define whether or not they want monthly, quarterly or annual data. For the time being, the argument doesn't do anything and it defaults to quarterly, but we should add this functionality for the future.

Change old_tadm to admission type

This isn't something that needs sorted immediately, but the n_emerg variable (produced in the smr_pmorbs function) uses the old_smr_tadm_code variable to identify emergency admissions. This is because the current hsmr time-period goes back as far as 2010 when admission_type wasn't very well recorded, but when the model changes in August it (hopefully) will only go back as far as 2014/2015 and will jump forward a quarter every time it publishes. So we should switch to the new admission_type variable at some point as we won't need old_tadm anymore.

Mostly just recording this as an issue so we don't forget!

Combine create_smr_data and create_trends_data

There's quite a bit of overlap between these two script files. Other than the population files, they both share the same lookups. Therefore, it makes sense to me to combine the files so that the trends data and SMR data are produced from running the same script. This would also increase the level of automation.

I'm also wondering whether or not we can add the excel script to this too, although that might be a bit excessive. Worth thinking about though. It is possible to knit an Rmarkdown document from an external script so we could in theory run the entire publication, including word output from the one script.

Combine pmorbs1 & pmorbs5

The pmorbs1 & pmorbs5 calculations run off separate extracts - pmorbs1 with an extra one year of data at the front and pmorbs5 with an extra five years of data at the front. We should be able to run both from the same extract (the one with five extra years), which may save a bit of time.

Update .gitignore

The .gitignore file should be updated to prevent data from being accidentally pushed to the repository 👍

DESCRIPTION author wrong

Need to update authors in the DESCRIPTION file

Fix death30 variable

When death30 is created (lines 130 and 131), any patient who never died is assigned 'NA'. This is because the ifelse() statement checks if the number of days till death is less than 30, but if the patient didn't die then the death date is NA. R will always return an NA if any part of a logical is NA.

The code needs to be altered so that the NAs are 0s.

Documentation

Need to decide what level to pitch the documentation at. I.e. should it be aimed at:

Someone who is totally new to ISD/HSMR
Someone familiar with ISD data (e.g. SMR01), but not HSMR
Someone familiar with HSMR

Code that may help with comorbidity look-back

Datsets to demonstrate incident cases Lookback function and Subsequent event function

pid <- c(rep(2,7),1)
eid <- c(seq(1,7,1),1)
year <- c(1981, 1991, 1995, 1996, 2000, 2005, 2012,2013)
dat <- data.frame(pid,eid,year)

Lookback <- function(data_frame, event_lbl, lb_years = 5, start_cohort = 1981){

Does lookback based on dataframe of relevant episodes

AS the same person can have one or more incident event, the

dataframe also contains an incident event code

Need to supply lookback period and start of cohort

data.table must be loaded

require(data.table)

dataframe must have pid, eid and year variables

if(!all(c("pid", "eid", "year") %in% names(data_frame))) return(NA)

If isn;t already a data table, make it a data table

if(!is.data.table(data_frame)) data_frame <- data.table(data_frame)

If does not already have a key for the relevant variables, create a key

if(!all(c("pid", "year") %in% key(data_frame))) setkey(data_frame, pid, year)

Create lagged time variable within patient to perform lookback

data_frame[,year_lag:=shift(year, 1), by=pid]

Calcualate difference in years for events within patient

data_frame[,time_diff:=year-year_lag,]

Identify incident cases as ones with sufficient lookback period and after

The cohort start date

data_frame[,index:=ifelse(year>=start_cohort &
(is.na(time_diff)|time_diff>=lb_years),
1,
0),]

Filter incident cases

return_df <- data_frame[index ==1,]

As is true incident event, set event_seq to 0

return_df$event_seq <- 0

Set event type as is incident event

return_df$event <- event_lbl

Create an incident event ID

return_df$inc_id <- seq_along(return_df$event)

Return selected varaibles

return_df[ , c("pid", "eid", "inc_id", "year", "event","event_seq")]

}
Lookback(dat, "mi", lb_years = 5)

incs <- Lookback(dat, "mi", lb_years = 5)

Create events table

events <- data.frame(pid = 1:2, year = c(1991, 1991, 1991, 1992, 1997, 1999, 2000, 2000, 2001, 2002,
2002, 2002, 2003, 2003, 2004, 2005, 2006, 2007, 2007, 2009, 2015, 2015,
2015, 2015),
eid = 1,
event = c("mi", "stroke", "hf", "death"), stringsAsFactors = FALSE)

Subsequent <- function(incident_df, event_df, fu_time = Inf){

Takes incident dataframe and events dataframe and

returns dataframe with first recurrent event within follow-up time period

Note output table in SAME format as table output from Lookback function

require(dplyr)

Drop unncessary variables from incident table and rename year to make incident

incident_df <- incident_df %>%
rename(year_inc = year) %>%
select(-eid, -event)

Select first subsequent event and label event sequence

event_df <- event_df %>%
inner_join(incident_df, by = "pid") %>%
filter(year_inc < year, (year_inc + fu_time) >= year ) %>%
arrange(inc_id, year) %>%
group_by(inc_id) %>%
slice(1) %>%
ungroup() %>%
mutate(event_seq = event_seq + 1) %>%
select(pid, eid, inc_id, year, event, event_seq)
# Create label for recurrent event
event_df
}
events1 <- Subsequent(incs, events)

Note can re-run old events table with a new events table to get more subsequent events

events2 <- Subsequent(events1, events)

Can then bind all together in a big long dataframe

cmbn <- bind_rows(incs, events1, events2) %>%
arrange(inc_id, event_seq, year, event)

Replace .sav lookup files with .rds lookups

Adding everything now so I don't forget! Tina's team has produced .rds versions of all the lookup files, so we should use them instead of the .sav versions. I think if we do this, we can remove haven from the setup file/list of package imports unless it is needed for something else, but I don't think it is.

Model Checking

Don't think there'll be time before the next publication, but a new function smr_checking (or possibly multiple functions if need be) should be written in future to do assumption checking of the model output produced in smr_model.

Run times of scripts

Hi guys - I've been working on an IR and I wanted to see how long it took to run (because I already knew it would be super fast!!) and remembered timing R scripts at uni using 'Sys.time()' to take start and end times and then subtracting the two - turns out my IR took 3.40711 mins to run! Thought it might be good to add something like this to the scripts, even if it's just for the next run, to get a more accurate run time.

Move funnel plot wrangling into smr_data

I didn't want to do this initially because I thought the funnel limits were overkill for the minimal tidy dataset, but the funnel limits are needed for pretty much everything the smr_data output is used for, so I suppose it makes sense to add them to this. Especially since the funnel_text function relies on the data file having the limits, which is a bit careless since the limits are created outwith functions. Alternatively, we could write a separate function, but I don't think that's necessary and it would be more straightforward/sensible to include them in smr_data.

Having said that, I don't think we need to keep all the variables generated as part of the funnel limit calculations. I think we should be fine to drop everything except the limits, so no need to save the standard error, Z scores etc in any of our excel files. This means we'll have to make small adjustments to the Excel template for table 1, but that's ok as we were planning on working on that anyway. May also need to trim down the error handling of funnel_text as it is looking for all the variables when it should only be looking for the few we decide to keep.

#this covers both the old server and the pro one
if (sessionInfo()$platform %in% c("x86_64-redhat-linux-gnu (64-bit)", "x86_64-pc-linux-gnu (64-bit)")) {
platform <- c("server")
} else  {
platform <- c("locally")
}

Just in case it's useful