public-health-scotland / slfhelper Goto Github PK

View Code? Open in Web Editor NEW

6.0 2.0 1.0 5.42 MB

An R package for working with the SLFs

Home Page: https://public-health-scotland.github.io/slfhelper/

License: Other

R 100.00%

r r-package

slfhelper's Introduction

slfhelper

The goal of slfhelper is to provide some easy-to-use functions that make working with the Source Linkage Files as painless and efficient as possible. It is only intended for use by PHS employees and will only work on the PHS R infrastructure.

Installation

The simplest way to install to the PHS Posit Workbench environment is to use the PHS Package Manager, this will be the default setting and means you can install slfhelper as you would any other package.

install.packages("slfhelper")

If this doesn’t work you can install it directly from GitHub, there are a number of ways to do this, we recommend the {pak} package.

# Install pak (if needed)
install.packages("pak")

# Use pak to install slfhelper
pak::pak("Public-Health-Scotland/slfhelper")

Usage

Read a file

Note: Reading a full file is quite slow and will use a lot of memory, we would always recommend doing a column selection to only keep the variables that you need for your analysis. Just doing this will dramatically speed up the read time.

We provide some data snippets to help with column selection and filtering.

library(slfhelper)

# Get a list of the variables in a file
ep_file_vars
indiv_file_vars

# See a lookup of Partnership names to HSCP_2018 codes
View(partnerships)

# See a list with descriptions for the recids
View(recids)

# See a list of Long term conditions
View(ltc_vars)

# See a list of bedday related variables
View(ep_file_bedday_vars)

# See a list of cost related variables
View(ep_file_cost_vars)

library(slfhelper)

# Read a group of variables e.g. LTCs (arth, asthma, atrialfib etc)
# A nice 'catch all' for reading in all of the LTC variables
ep_1718 <- read_slf_episode("1718", col_select = c("anon_chi", ltc_vars))

# Read in a group of variables e.g. bedday related variables (yearstay, stay, apr_beddays etc)
# A 'catch all' for reading in bedday related variables
ep_1819 <- read_slf_episode("1819", col_select = c("anon_chi", ep_file_bedday_vars))

# Read in a group of variables e.g. cost related variables (cost_total_net, apr_cost)
# A 'catch all' for reading in cos related variables
ep_1920 <- read_slf_episode("1920", col_select = c("anon_chi", ep_file_cost_vars))

library(slfhelper)

# Read certain variables
# It's much faster to choose variables like this
indiv_1718 <- read_slf_individual(year = "1718", col_select = c("anon_chi", "hri_scot"))

# Read multiple years
# This will use dplyr::bind_rows() and return the files added together as a single tibble
episode_data <- read_slf_episode(
  year = c("1516", "1617", "1718", "1819"),
  col_select = c("anon_chi", "yearstay")
)

# Read only data for a certain partnership (HSCP_2018 code)
# This can be a single partnership or multiple by supplying a vector e.g. c(...)
indiv_1718 <- read_slf_individual(
  year = "1718",
  partnerships = "S37000001", # Aberdeen City
  col_select = c("anon_chi", "hri_scot")
)

# Read only data for a certain recid
# This can be a single recid or multiple by supplying a vector e.g. c(...)
ep_1718 <- read_slf_episode("1718", recid = c("01B", "GLS"), col_select = c("anon_chi", "yearstay"))

The above options for reading files can (and should) be combined if required.

Match on CHI numbers to Anon_CHI (or vice versa)

library(slfhelper)

# Add real CHI numbers to a SLF
ep_1718 <- read_slf_episode(c("1718", "1819", "1920"),
  col_select = c("year", "anon_chi", "demographic_cohort")
) %>%
  get_chi()

# Change chi numbers from the data above back to anon_chi
ep_1718_anon <- ep_1718 %>%
  get_anon_chi(chi_var = "chi")

# Add anon_chi to the cohort sample
chi_cohort <- chi_cohort %>%
  get_anon_chi(chi_var = "upi_number")

slfhelper's People

Contributors

Stargazers

Watchers

Forkers

hannahj16

slfhelper's Issues

Tidyselect helpers don't work in `col_select`

Originally posted by @Moohan in #83 (comment)

Public-Health-Scotland/phslookups#9 solves this same problem (and adds tests).

Select Partnership

Implement code so that you can select a specific partnership and only return their data.

Adding years

It would be nice if you could specify multiple years and these are returned added together using dplyr::bind_rows

Simplify selection code

Use purr better by using list_of_years %>% map(function, common_args)' rather than complicated_list %>% pmap(...)`

Update / fix to work with dplyr 1+

Possible additional test and re-write required

Include and expose value and variable labels

Have value labels stored somehow, then have functions for people to use this easily. One idea is to store them as JSON, with get functions:

vars = list(gender = list(
  description = "Patient's gender",
  values = list("1" = "Female", "2" = "Male", "9" = "Unknown"),
  type = "integer"
)) %>% 
  jsonlite::toJSON()

get_values <- function(variable) {
  json <- jsonlite::fromJSON(vars)
  
  return(json[[variable]][["values"]])
}

get_values("gender")

Could also have functions that would 'swap out' values for labels (to be used for plots etc.) mutate(gender = slf_factor_labels(gender))

Do a better job of checking the supplied `year` is valid

This should give a good error.

slfhelper::read_slf_episode("2022", "recid")

This code should be doing this but doesn't for some reason: https://github.com/Public-Health-Scotland/slfhelper/blob/production/R/format_year.R.

Ideally, we would have a super helpful error message along the lines of: error in read_slf_ep() the year “2022” isn’t a valid year, did you mean ‘2021’ or ‘2122’?”.

Update to new variables

In the latest update, social care variables were added and generally rearranged. Variable list files need updating.

Force returning some variables in certain cases

If we're returning more than one years worth of data, we should always return the year variable, even if it wasn't specified in the columns argument.

Other cases would be, when recid or partnership etc. arguments are used, if we are returning more than one, always return the filtering variable. The current code will extract recid etc if it's needed for filtering but then not return it if it wasn't specifically asked for in the columns parameter.

R for Data Science (2e) - 24 Arrow

https://r4ds.hadley.nz/arrow.html

Do a better job of checking the supplied `year` is valid

This should give a good error.

slfhelper::read_slf_episode("2022", "recid")

This code should be doing this but doesn't for some reason: https://github.com/Public-Health-Scotland/slfhelper/blob/production/R/format_year.R.

Ideally, we would have a super helpful error message along the lines of: error in read_slf_ep() the year “2022” isn’t a valid year, did you mean ‘2021’ or ‘2122’?”.

Add dev switch to allow using the in-development files on sourcedev

Include checks for the existence of files.
Include check and warning if hscdiip file is newer / equal (hash check?) to dev file.
Possibly include temp building of fst files if required.

Allow filtering on Health Board

hms::as_hms() function slow down the reading of slf files

This issue is linked to the modification of #77. hms::as_hms() function slow down the reading of slf files because the parquet format does not support hms::as_hms(). @Moohan found this issue and suggested a workaround like

slfhelper::read_slf_episode(
  "1920",
  col_select = !dplyr::starts_with("keytime"),
  as_data_frame = FALSE
)

We can see this Public-Health-Scotland/phslookups#9 for more technical details on how to achieve this feature.

Create variable groups

e.g. cost_vars() which would include all cost related variables.

The idea is to make using column selection easier.

Supress warning from fst on load

See Public-Health-Scotland/source-linkage-files#247

Look into chunking on read file

Using fst::metadata_fst to get the file size then read in chunks. Read and filter each one sequentially to reduce overall memory usage (probably slower than current though). Alternatively, read the chunks in parallel to improve read speed?

Add filter for missing CHI

Could be done like the other filters i.e. filter(!is.na(anon_chi)) but since all the missing CHIs appear at the start of the file it might be better to have an index of row numbers (which would need to be kept up to date - this could be checked with tests). Then the index could be used to do read_fst(from = <first_row_with_non_missing_chi>