Coder Social home page Coder Social logo

terryli710 / starr-labeler Goto Github PK

View Code? Open in Web Editor NEW

This project forked from louisblankemeier/starr-labeler

0.0 0.0 0.0 162 KB

Package for generating feature vectors and outcome labels from electronic health records.

License: Apache License 2.0

Python 99.83% Makefile 0.17%

starr-labeler's Introduction

Code for processing STARR electronic health record data

License: Apache 2.0 GitHub Workflow Status

Installation

git clone https://github.com/louisblankemeier/STARR-Labeler
cd STARR-Labeler
conda create --name starr_env python=3.9
conda active starr_env
pip install -e .

Generating Feature Vectors

First, create a config file in starr_labeler/features/configs. You can use the existing config files as templates. Use the following to generate features:

cd starr_labeler/features/
python main.py --config-name <name_of_config_file>

To run with slurm, edit the command in starr_labeler/features/slurm.py. Then, run:

python slurm.py

The generated features can be used to train a classifier for a specified disease (label). The output file is called features.csv.

Example features config file snippet:

DATA_PATH: /clinical_data/
SAVE_DIR: /STARR-Labeler/starr_labeler/features/results
NUM_PATIENTS: 23547
PREDICTION_DATES: 'CT'
DAYS_AFTER_PREDICTION_DATES: 14
EHR_TYPES:
  DEMOGRAPHICS: 
    USE: True
    FILE_NAME: "demographics.csv"
    USE_COLS: ['Patient Id', 'Gender', 'Race', 'Ethnicity', 'Date of Birth']
    REGEX_TO_FEATURE_NAME: {'Gender': None, 'Race': None, 'Ethnicity': None, 'Date of Birth': None} # Date of Birth here gets converted to age.
    TIME_BINS: 1
    TIME_BIN_DURATION: 1 # in years
    AGGREGATE_ACROSS_TIME: 'mean'
    FILL_NA: 'median'
    SAVE: True
    LOAD: False # whether to load from
  
  LABS: 
    USE: True
    FILE_NAME: "labs.csv"
    USE_COLS: ['Patient Id', 'Value', 'Taken Date', 'Result', 'Reference High', 'Reference Low', 'Units']
    TYPE: 'Result'
    Value: 'Value'
    REGEX_TO_FEATURE_NAME: {'^HDL Cholesterol': 'HDL Cholesterol', 'Cholesterol, Total': 'Cholesterol, Total', 'Creatinine, Ser/Plas': None}
 
  • A patient can have multiple CT scans over the course of the requested time, but NUM_PATIENTS corresponds to the number of unique patients.

  • If the EHR data was not logged into the system on the same date as the CT Scan, DAYS_AFTER_PREDICTION_DATES defines how many days after the CT exam the EHR data is still usable/acceptable.

  • With USE_COLS, load only specified columns of the csv for improved speed.

  • Time resolution (in years) can be increased by getting multiple entries from one subject according to the specified TIME_BINS. If the value is 1, there will be 1 feature vector extracted for the time window specified above. A larger number of bins corresponds to multiple feature vectors extracted from multiple time windows according to the number of bins and the specified bin duration. The number of bins is also shown in the variable name of the output CSV with '_1' for bin 1, '_2' for bin 2 etc.

  • If multiple entries are present for one subject specify aggregation strategy with AGGREGATE_ACROSS_TIME.

  • The keys in REGEX_TO_FEATURE_NAME define regex expressions and all values that match the regex are mapped to a single feature with a name given by the value in the dictionary. In the example shown above all variables with names starting from HDL Cholesterol will be matched.

Generating Outcome Labels

First, create a config file in starr_labeler/labels/disease_configs. You can use the existing config files as templates. Use the following to generate outcome labels:

cd starr_labeler/labels/
python main.py --config-name <name_of_config_file>

To run with slurm, edit the command in starr_labeler/labels/slurm.py. Then, run:

python slurm.py

The generated labels can be used to train a classifier for a specified outcome (label) using the EHR features discussed above. The output file is called labels.csv.

Example labels config file snippet:

DISEASE_NAME: abdominal_aortic_aneurysm
DAYS_AFTER: 1825
DAYS_BEFORE: 0
CONSIDER_ONLY_FIRST_DIAGNOSIS: True
DATA_PATH: /clinical_data/
SAVE_DIR: /STARR-Labeler/starr_labeler/results
NUM_PATIENTS: 23547
PREDICTION_DATES: 'CT'

EHR_TYPES:
  DIAGNOSES:
    FILE_NAME: "diagnoses.csv"
    USE_COLS: ['Patient Id', 'Date', 'ICD10 Code', 'ICD9 Code']
    ICD10:
      Hierarchical: True
      Codes:
        - I71.0
        - I71.3
        - I71.1
        - I71.4
        - I71.2
    ICD9:
      Hierarchical: True
      Codes:
        - 441.0
        - 441.3
        - 441.1
        - 441.4

Input Definitions

  • DAYS_AFTER defines the end of the time window in days after the CT Scan during which evidence of the disease results in a positive label.
  • DAYS_BEFORE defines the start of the time window in days before the CT Scan during which evidence of the disease results in a positive label.
  • Formally, we may represent relevant points in a patient's timeline / history as the following:
    • $t_0$ : first time point of record in a patient's history.
    • $t_d$ : time point marking the onset of disease.
    • $t_b$ : time point marking DAYS BEFORE a scan.
    • $t_s$ : time point marking the date of the scan.
    • $t_a$ : time point marking the DAYS AFTER a scan (e.g., window)
    • $t_h$ : time point marking the last date of record in a patient's history.

Note that $t_a$ - $t_s$, which is the window in which we're identifying events specified for our experimental design, may be longer than $t_h$ - $t_0$, which is the time period a patient has data in the EHR for. This is crucial!

Output Label Interpretation

  • Class 0 → Patient not diagnosed with the specified disease before or after the CT Scan within the requested time window or any time in the patient's history.
    • IF a patient develops disease, then $t_d &gt; t_a$.
  • Class 1 → Patient diagnosed with specified disease during the time window between days before and days after.
  • Commontly, only patients from Class 0 and 1 are included in the studies as controls/positive classes.
    • $t_b ≤ t_d ≤ t_a$
  • Class 2 → Patient diagnosed with specified disease earlier than the 'days before' window (e.g. patient is already "diseased" entering the cohort).
    • $t_d &lt; t_b$
  • Class 3 → Patient has not been monitored long enough to definitively rule-out disease.
    • $t_h &lt; t_a$
  • In the ICD10 and ICD9, the codes associated with the specified disease need to be defined.

Citation

Please cite this work if you use it for your research.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.