Light

ropensci-archive / cleanehr Goto Github PK

:warning: ARCHIVED :warning: Essential tools and utility functions to facilitate the data processing pipeline, data cleaning and data analysing of clinical data from CC-HIC

License: GNU General Public License v3.0

Makefile 0.64% R 92.52% C++ 2.63% Shell 0.77% TeX 3.44%

electronic-health-record healthcare big-data critical-care intensive-care r rstats r-package

cleanehr's Introduction

This package has been archived. The former README is now in README-not.

cleanehr's People

Contributors

Stargazers

Watchers

cleanehr's Issues

Currently, episodes that time data cannot be paired are entirely removed. That's why we have many NULL patients in final ccdata object. This is caused majorly by the missing data of a given time stamp. We could perhaps meanwhile assume cases the other way around. To solve it we may have to have a large overhaul of the parser.

have time, no value: remove time
have value, no time: remove the data field.

tasks

make readItems faster
Steve: retrieve data by readItems()
Sinan: give Steve the latest version of ccdata
SN: delta time.
collapse items over time by given frequency, default shall be median.
propagate time series. (fill the missing data)
add difftime units to ccdata - hours, mins, ...
adapt reallocateTimeRecord() to given units.

drop_field

Now we have drop_episode and drop_entry, we need drop_field which allows the removal of the field in one episode.

better display of filter result

print out how many entry has been removed when calling filter functions.
store missingness information table in S5 class instead of computing it every time.
entry, hospital, information in show()

Meropenem dose

Some meropenem doses are 1000 and some are 1. Do you think it is just about the difference unit?

Reconfirm the files

@nsmaccallum @docsteveharris maybe one of you could let me know if the following files on IDSH are the files that we need:
In total: 14007 episodes.

Cambridge:
NIRH_CC_1_CUH_07042016.xml

GSTT:
NIHRCC_1.5_GSTT_31032016.xml

Imperial:
NIHRCC_8.3.2_042014-082014.xml
NIHRCC_8.3.2_092014-122014.xml
NIHRCC_8.3.2_012015-042015.xml
NIHRCC_8.3.2_052015-082015.xml
NIHRCC_8.3.2_092015-102015.xml
NIHRCC_8.3.2_112015-122015.xml

Oxford:
NIHRCC_8.3.2_Oxford 02042016.xml

UCLH:
20160217a_ReId_cc.xml

typo

admin_icu_time -> adm_icu_time

modify data structure: array -> vector

SOFA score under current structure

Data quality report

Data missingness report #36
First and last date of admission per site ucl-hic/paper-brc#23
episode id missingness. #62
report structral quality (site, episode, date of admission and so on) ucl-hic/paper-brc#43

how to deal with patient records without any demographic information

Shall we remove the record without nhs number , pas number, episode id or site id? We have around 60 records like that.

parse new data set for UCLH

SOFA score development

Parse XML file on IDSH by using updated parser.

delta time

PAS number becomes only the first PAS number of an episode.

Since PAS numbers can be different for each patient, the data.table ccd@pas_number means only the PAS number of the first episode. This may lead to confusion. We should change it in due course.

To change them see addEpisodeToPatient()

Data missingness report

episode missingness - consecutive (David)
1d
hospital, item, np, ne. per year (take episode id)
2d - mean of episodes and mean of total
how long is the first value recorded after admission time? to create a table contains columns (episode_id, item_id, site_id, delta_time)
provide missingness frequency (Steve)
Data quality check for drugs.
count number of doses per drug.
number of dose per day, when >= 1 dose on a day. we need to know how many times he/she takes the drug, and the total dose of a day.
record interval of doeses. figure out the gap (say 24h) of drug intake. create a table that contains number of dose/interval/day, interval time and gap time.

considering change time format to standard POSIX

Parse Cambridge data on IDSH

so that we can report the total episode.

unit testing of R code

what will be the best data structure in R ?

At the moment we are using a temporary ccdata structure, which presumably will be changed when we have more knowledge about the tasks.

SQL approach

Inject the wide table into the database (in fact it's wide and long) (call it longitudinal table)
Inject non-longitudinal table.
Episode selection table.
define how rules can apply to the wide table in DB. Rules can be discarding/modify values and episode.
Rules:
- remove episode when the missingness of a variable is too high.
- impute data in a given window, with a given function.
A table in database which stores each steps description, modified rows and columns.
API: to pull out the vectors.

add codecov to travis run

Add this service to the repository

Number of episode not matching

IDSH = 15444(3)
Local = 15677

split 1d and 2d data from select_data

use consistent NHIC label.

remove missing admission time episodes when calculating deltaTime

In IDSH we have around 300 such episodes (mainly from Oxford and GSTT). These episodes are without exception those with no episode id, site, and all the other demographic data. What we only know is that these data come from exclusively two files (GSTT, and Oxford).
I'm going to remove NULL episodes when doing deltaTime. Is this OK for you @docsteveharris ?

shall we exclude episodes wih missing discharge time?

discharge time will be used as the endpoint of time frame in data.table.

on the dead_live field they have NULL(1006/2008), D(981/2008) , E(21/2008)

aggregation function in C for delta time table where fields are numeric

done in R during paper-brc sprint but could do with performance improvement

HDF5 approach

detect duplicated episodes

A security check should be provided in order to detect duplicated episodes. It may happen when XML files are overlaped.

tests on identified data set.

waiting for the data set and the access of data safe haven.

jupyter notebook for R

https://github.com/IRkernel/IRkernel

ccdata compiler error on IDHS

when compiling the library using R in the safe haven it tries to load certain libraries which are not installed.
Different variants on Makevars have been tried and followed up some of the suggestions mentioned around - but no look yet.

Compiling it by itself and installing ccdata works.

unix machine visualise result

asking IDSH people to open a share link between windows and unix machine, or making the X11 support available.

email sent

Use logger

data cleaning

has item check (for 1D data mainly)
include numerical range check for 1D
date range check
in the yaml configuration, put apply to the end of every filter instead of in the bottom.
derived field ucl-hic/paper-brc#38

Tests:

configuration check.
- what if giving the missingness filter a 1d data by mistake?

de-identification approach

add to YAML file those fields that should be removed absolutely
add a dictionary where fields interact with others and create a small cell risk

xml parsing pipeline

Build pipeline for XML parsing, that convert multiple xml files to delta time ccdata. The process should be

split XML files when it is too big. break_into.sh
parse individual XML files to RData. extract_data.r
combine RData combine.r
run some processing pipeline

We probably need to use either xargs or GNU parallel for parallel job scheduling.

security check for yaml configuration

should be able to detect typos when writting the configuration file.

Are demographic data the only 1d data? i.e. time dependent

why the number of episode is different between identifiable and de-identifiable data?

We can exclude the possibility of duplicate injection. By checking the duplicats of the combination of site_id, episode_id, admission time and so on, we didn't find a duplicated injection of episode on both data set.

Then why it could happen? We should ask @NicolaCooper later on.

BDD?

xml parser load selected items

This is for extracting the identifiable data on DHS.

writting tests of the ccdata package

pipeline - file breaker add extra file

Cambridge 07042016 broke into three files instead of two adding two endings to the last one. Solved manually by now but it need debugging.
Files saved in ~ucasper/Cambridge

first scratch

always make sure that the patient list is consecutive

we need to avoid stuff like

ccd@npatient
> 100
ccd@patients[[200]] <- something

We should forbid this operation!

data provenance

from Steve:

Probably one for post paper but data will be updated at various times and we'll need to know where it comes from. Could version control the data? Or add a field indicating its source? eg from file X on date Y

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.