Coder Social home page Coder Social logo

ropensci-archive / cleanehr Goto Github PK

View Code? Open in Web Editor NEW
54.0 15.0 23.0 6.63 MB

:warning: ARCHIVED :warning: Essential tools and utility functions to facilitate the data processing pipeline, data cleaning and data analysing of clinical data from CC-HIC

License: GNU General Public License v3.0

Makefile 0.64% R 92.52% C++ 2.63% Shell 0.77% TeX 3.44%
electronic-health-record healthcare big-data critical-care intensive-care r rstats r-package

cleanehr's Introduction

Project Status: Unsupported Peer-review badge

This package has been archived. The former README is now in README-not.

cleanehr's People

Contributors

arfon avatar docsteveharris avatar dpshelio avatar jeroen avatar katrinleinweber avatar maelle avatar sinanshi avatar spiros avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cleanehr's Issues

time data not in pair

Currently, episodes that time data cannot be paired are entirely removed. That's why we have many NULL patients in final ccdata object. This is caused majorly by the missing data of a given time stamp. We could perhaps meanwhile assume cases the other way around. To solve it we may have to have a large overhaul of the parser.

  • have time, no value: remove time
  • have value, no time: remove the data field.

tasks

  • make readItems faster
  • Steve: retrieve data by readItems()
  • Sinan: give Steve the latest version of ccdata
  • SN: delta time.
  • collapse items over time by given frequency, default shall be median.
  • propagate time series. (fill the missing data)
  • add difftime units to ccdata - hours, mins, ...
  • adapt reallocateTimeRecord() to given units.

drop_field

Now we have drop_episode and drop_entry, we need drop_field which allows the removal of the field in one episode.

better display of filter result

  • print out how many entry has been removed when calling filter functions.
  • store missingness information table in S5 class instead of computing it every time.
  • entry, hospital, information in show()

Meropenem dose

Some meropenem doses are 1000 and some are 1. Do you think it is just about the difference unit?

Reconfirm the files

@nsmaccallum @docsteveharris maybe one of you could let me know if the following files on IDSH are the files that we need:
In total: 14007 episodes.

Cambridge:
NIRH_CC_1_CUH_07042016.xml

GSTT:
NIHRCC_1.5_GSTT_31032016.xml

Imperial:
NIHRCC_8.3.2_042014-082014.xml
NIHRCC_8.3.2_092014-122014.xml
NIHRCC_8.3.2_012015-042015.xml
NIHRCC_8.3.2_052015-082015.xml
NIHRCC_8.3.2_092015-102015.xml
NIHRCC_8.3.2_112015-122015.xml

Oxford:
NIHRCC_8.3.2_Oxford 02042016.xml

UCLH:
20160217a_ReId_cc.xml

typo

admin_icu_time -> adm_icu_time

Data quality report

  • Data missingness report #36
  • First and last date of admission per site ucl-hic/paper-brc#23
  • episode id missingness. #62
  • report structral quality (site, episode, date of admission and so on) ucl-hic/paper-brc#43

SOFA score development

  • xml2ccdata
    • optimise searching algorithm
    • unit test
  • SOFA score
  • deliver a graph that compares motality density and score density.

PAS number becomes only the first PAS number of an episode.

Since PAS numbers can be different for each patient, the data.table ccd@pas_number means only the PAS number of the first episode. This may lead to confusion. We should change it in due course.

To change them see addEpisodeToPatient()

Data missingness report

  • episode missingness - consecutive (David)
  • 1d
    hospital, item, np, ne. per year (take episode id)
  • 2d - mean of episodes and mean of total
  • how long is the first value recorded after admission time? to create a table contains columns (episode_id, item_id, site_id, delta_time)
  • provide missingness frequency (Steve)
  • Data quality check for drugs.
  • count number of doses per drug.
  • number of dose per day, when >= 1 dose on a day. we need to know how many times he/she takes the drug, and the total dose of a day.
  • record interval of doeses. figure out the gap (say 24h) of drug intake. create a table that contains number of dose/interval/day, interval time and gap time.

SQL approach

  • Inject the wide table into the database (in fact it's wide and long) (call it longitudinal table)
  • Inject non-longitudinal table.
  • Episode selection table.
  • define how rules can apply to the wide table in DB. Rules can be discarding/modify values and episode.
  • Rules:
    - remove episode when the missingness of a variable is too high.
    - impute data in a given window, with a given function.
  • A table in database which stores each steps description, modified rows and columns.
  • API: to pull out the vectors.

remove missing admission time episodes when calculating deltaTime

In IDSH we have around 300 such episodes (mainly from Oxford and GSTT). These episodes are without exception those with no episode id, site, and all the other demographic data. What we only know is that these data come from exclusively two files (GSTT, and Oxford).
I'm going to remove NULL episodes when doing deltaTime. Is this OK for you @docsteveharris ?

detect duplicated episodes

A security check should be provided in order to detect duplicated episodes. It may happen when XML files are overlaped.

ccdata compiler error on IDHS

when compiling the library using R in the safe haven it tries to load certain libraries which are not installed.
Different variants on Makevars have been tried and followed up some of the suggestions mentioned around - but no look yet.

Compiling it by itself and installing ccdata works.

unix machine visualise result

asking IDSH people to open a share link between windows and unix machine, or making the X11 support available.

  • email sent

data cleaning

  • has item check (for 1D data mainly)
  • include numerical range check for 1D
  • date range check
  • in the yaml configuration, put apply to the end of every filter instead of in the bottom.
  • derived field ucl-hic/paper-brc#38

Tests:

  • configuration check.
    • what if giving the missingness filter a 1d data by mistake?

de-identification approach

add to YAML file those fields that should be removed absolutely
add a dictionary where fields interact with others and create a small cell risk

xml parsing pipeline

Build pipeline for XML parsing, that convert multiple xml files to delta time ccdata. The process should be

  1. split XML files when it is too big. break_into.sh
  2. parse individual XML files to RData. extract_data.r
  3. combine RData combine.r
  4. run some processing pipeline

We probably need to use either xargs or GNU parallel for parallel job scheduling.

pipeline - file breaker add extra file

Cambridge 07042016 broke into three files instead of two adding two endings to the last one. Solved manually by now but it need debugging.
Files saved in ~ucasper/Cambridge

first scratch

  • define a data structure in R
    • clean csv data - labels
  • make csv xml conversion
  • function to select data
  • example plots

data provenance

from Steve:

Probably one for post paper but data will be updated at various times and we'll need to know where it comes from. Could version control the data? Or add a field indicating its source? eg from file X on date Y

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.