Coder Social home page Coder Social logo

slfhelper's Introduction

GitHub release (latest by date) Lifecycle: stable R-CMD-check codecov

slfhelper

The goal of slfhelper is to provide some easy-to-use functions that make working with the Source Linkage Files as painless and efficient as possible. It is only intended for use by PHS employees and will only work on the PHS R infrastructure.

Installation

The simplest way to install to the PHS Posit Workbench environment is to use the PHS Package Manager, this will be the default setting and means you can install slfhelper as you would any other package.

install.packages("slfhelper")

If this doesn’t work you can install it directly from GitHub, there are a number of ways to do this, we recommend the {pak} package.

# Install pak (if needed)
install.packages("pak")

# Use pak to install slfhelper
pak::pak("Public-Health-Scotland/slfhelper")

Usage

Read a file

Note: Reading a full file is quite slow and will use a lot of memory, we would always recommend doing a column selection to only keep the variables that you need for your analysis. Just doing this will dramatically speed up the read time.

We provide some data snippets to help with column selection and filtering.

library(slfhelper)

# Get a list of the variables in a file
ep_file_vars
indiv_file_vars

# See a lookup of Partnership names to HSCP_2018 codes
View(partnerships)

# See a list with descriptions for the recids
View(recids)

# See a list of Long term conditions
View(ltc_vars)

# See a list of bedday related variables
View(ep_file_bedday_vars)

# See a list of cost related variables
View(ep_file_cost_vars)
library(slfhelper)

# Read a group of variables e.g. LTCs (arth, asthma, atrialfib etc)
# A nice 'catch all' for reading in all of the LTC variables
ep_1718 <- read_slf_episode("1718", col_select = c("anon_chi", ltc_vars))

# Read in a group of variables e.g. bedday related variables (yearstay, stay, apr_beddays etc)
# A 'catch all' for reading in bedday related variables
ep_1819 <- read_slf_episode("1819", col_select = c("anon_chi", ep_file_bedday_vars))

# Read in a group of variables e.g. cost related variables (cost_total_net, apr_cost)
# A 'catch all' for reading in cos related variables
ep_1920 <- read_slf_episode("1920", col_select = c("anon_chi", ep_file_cost_vars))
library(slfhelper)

# Read certain variables
# It's much faster to choose variables like this
indiv_1718 <- read_slf_individual(year = "1718", col_select = c("anon_chi", "hri_scot"))

# Read multiple years
# This will use dplyr::bind_rows() and return the files added together as a single tibble
episode_data <- read_slf_episode(
  year = c("1516", "1617", "1718", "1819"),
  col_select = c("anon_chi", "yearstay")
)

# Read only data for a certain partnership (HSCP_2018 code)
# This can be a single partnership or multiple by supplying a vector e.g. c(...)
indiv_1718 <- read_slf_individual(
  year = "1718",
  partnerships = "S37000001", # Aberdeen City
  col_select = c("anon_chi", "hri_scot")
)

# Read only data for a certain recid
# This can be a single recid or multiple by supplying a vector e.g. c(...)
ep_1718 <- read_slf_episode("1718", recid = c("01B", "GLS"), col_select = c("anon_chi", "yearstay"))

The above options for reading files can (and should) be combined if required.

Match on CHI numbers to Anon_CHI (or vice versa)

library(slfhelper)

# Add real CHI numbers to a SLF
ep_1718 <- read_slf_episode(c("1718", "1819", "1920"),
  col_select = c("year", "anon_chi", "demographic_cohort")
) %>%
  get_chi()

# Change chi numbers from the data above back to anon_chi
ep_1718_anon <- ep_1718 %>%
  get_anon_chi(chi_var = "chi")

# Add anon_chi to the cohort sample
chi_cohort <- chi_cohort %>%
  get_anon_chi(chi_var = "upi_number")

slfhelper's People

Contributors

dependabot[bot] avatar github-actions[bot] avatar jennit07 avatar lizihao-anu avatar moohan avatar shintolamp avatar swiftysalmon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

hannahj16

slfhelper's Issues

Select Partnership

Implement code so that you can select a specific partnership and only return their data.

Adding years

It would be nice if you could specify multiple years and these are returned added together using dplyr::bind_rows

Simplify selection code

Use purr better by using list_of_years %>% map(function, common_args)' rather than complicated_list %>% pmap(...)`

Include and expose value and variable labels

Have value labels stored somehow, then have functions for people to use this easily. One idea is to store them as JSON, with get functions:

vars = list(gender = list(
  description = "Patient's gender",
  values = list("1" = "Female", "2" = "Male", "9" = "Unknown"),
  type = "integer"
)) %>% 
  jsonlite::toJSON()

get_values <- function(variable) {
  json <- jsonlite::fromJSON(vars)
  
  return(json[[variable]][["values"]])
}

get_values("gender")

Could also have functions that would 'swap out' values for labels (to be used for plots etc.) mutate(gender = slf_factor_labels(gender))

Update to new variables

In the latest update, social care variables were added and generally rearranged. Variable list files need updating.

Force returning some variables in certain cases

If we're returning more than one years worth of data, we should always return the year variable, even if it wasn't specified in the columns argument.

Other cases would be, when recid or partnership etc. arguments are used, if we are returning more than one, always return the filtering variable. The current code will extract recid etc if it's needed for filtering but then not return it if it wasn't specifically asked for in the columns parameter.

More documentation and examples

Hey, it would be great to have some more documentation in the form of examples (maybe as an article/vignette). I think the col_select is emphasised enough on the main README but I still see people writing code that doesn't use it... The various features (column selection, recid and partnership filtering) may get lost on the main README.

Now the files are using arrow/parquet, the really powerful feature is to read them as a 'dataset', (using as_dataframe = FALSE) i.e. Working with Arrow Datasets and dplyr. It would be great to have some examples showing that as it allows people to do filtering that isn't in-built efficiently e.g. filtering to all_population etc..

I'd be happy to work with someone on this and I'd also look to get some more 'full' examples (i.e. doing some analysis as well, slightly beyond the scope of slfhelper) on the LIST-examples repo where we have 'best-practice' code for common LIST analyses.

Create variable groups

e.g. cost_vars() which would include all cost related variables.

The idea is to make using column selection easier.

Look into chunking on read file

Using fst::metadata_fst to get the file size then read in chunks. Read and filter each one sequentially to reduce overall memory usage (probably slower than current though). Alternatively, read the chunks in parallel to improve read speed?

Add filter for missing CHI

Could be done like the other filters i.e. filter(!is.na(anon_chi)) but since all the missing CHIs appear at the start of the file it might be better to have an index of row numbers (which would need to be kept up to date - this could be checked with tests). Then the index could be used to do read_fst(from = <first_row_with_non_missing_chi>

Add check to year(s)

Provide error message, if a year isn't valid.
Older files won't necessarily work with new files as some variables are missing.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.