Coder Social home page Coder Social logo

acadtags / rare-disease-identification Goto Github PK

View Code? Open in Web Editor NEW
25.0 2.0 7.0 5.05 MB

Rare disease identification from free-text clinical notes with ontologies and weak supervision

License: MIT License

Shell 0.06% Python 96.38% Jupyter Notebook 3.56%
ontology ordo weak-supervision rare-disease clinical-notes mimic-iii discharge-summaries umls entity-linking ontology-matching

rare-disease-identification's Introduction

Rare-disease-identification

This repository presents an approach using ontologies and weak supervision to identify rare diseases from clinical notes. The idea is illustrated below and the data annotation for rare disease entity linking and ontology matching is available for download.

The latest preprint is available on arXiv, Ontology-Driven and Weakly Supervised Rare Disease Identification from Clinical Notes, accepted for BMC Medical Informatics and Decision Making. This is an extension of the previous work published in IEEE EMBC 2021.

Entity linking and ontology matching

A graphical illustration of the entity linking and ontology matching process:

Weak supervision (WS)

The process to create weakly labelled data with contextual representation is illustrated below:

Rare disease mention annotations

The annotations of rare disease mentions created from this research are available in the folder data annotation.

Implementation sources

Pipeline

Note: This is mainly research-based implementation, rather than well-engineered software, but we hope that the code, data, and results provide more details to this work and are useful.

Data and models

The data files and BERT models are placed according to the structure below. The SemEHR outputs for MIMIC-III discharge summaries (mimic-semehr-smp-outputs\outputs) and MIMIC-III radiology reports (mimic-rad-semehr-outputs\outputs) were obtained by running SemEHR.

└───bert-models
|   |   run_get_bluebert.sh
|   |   NCBI_BERT_pubmed_mimic_uncased_L-12_H-768_A-12
|   |   |   ... (model files)
└───data/
|   |   NOTEEVENTS.csv (from MIMIC-III)
|   |   DIAGNOSES_ICD.csv (from MIMIC-III)
|   |   PROCEDURES_ICD.csv (from MIMIC-III)
|   |   mimic-semehr-smp-outputs
|   |   |   outputs
|   |   |   |   ... (SemEHR output files of MIMIC-III DS)
|   |   mimic-rad-semehr-outputs
|   |   |   outputs
|   |   |   |   ... (SemEHR output files of MIMIC-III rad)
└───models/
|   |   ... (phenotype confirmation model `.pik` files)
└───ontology/
|   |   ORDO2UMLS_ICD10_ICD9+titles_final_v3.xlsx 
        (ontology concept matching file) 

Key pipeline scripts

  • Weakly supervised data creation: main_scripts/step1_tr_data_creat_ment_disamb.py.
  • Weakly supervised data representation and model training: main_scripts/step3.4 for MIMIC-III discharge summaries, main_scripts/step3.6 for MIMIC-III (and Tayside) radiology reports.
    • static BERT-based encoding is implemented in def encode_data_tuple() in main_scripts/sent_bert_emb_viz_util.py using BERT-as-service;
    • a fine-tuning approach with Huggingface Transformers is in other_scripts/step3.8_fine_tune_bert_with_trainer.py.

If all files are set (MIMIC-III data, SemEHR outputs, BERT models), the main steps of the whole pipeline can be run with python run_main_steps.py.

Reproducing results from the paper

This does not need to run the pipeline above, as it is based on the prediction scores.

Move all the files inside main_scripts (and other_scripts) to the upper folder.

Main results: Text-to-UMLS

MIMIC-III discharge summaries: python step4_further_results_from_annotations.py

MIMIC-III radiology reports: python step4.1_further_results_from_annotations_for_rad.py

Error analysis: python error_analysis.py

Other results: UMLS-to-ORDO, Text-to-ORDO

UMLS-to-ORDO: calculated from results in raw annotations (with model predictions).

Text-to-ORDO, mention-level: see step7 and step7.1 in other_scripts.

Text-to-ORDO, admission-level: see step8 and step8.1 in other_scripts.

Acknowledgement

This work has been carried out by members from KnowLab, also thanks to the EdiE-ClinicalNLP research group.

Acknowledgement to the icons used:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.