callahantiff / omop2obo Goto Github PK

View Code? Open in Web Editor NEW

79.0 10.0 12.0 94.17 MB

OMOP2OBO: A Python Library for mapping OMOP standardized clinical terminologies to Open Biomedical Ontologies

Home Page: http://tiffanycallahan.com/OMOP2OBO_Dashboard

License: MIT License

Python 3.86% Jupyter Notebook 96.14%

open-biomedical-ontologies omop clinical-terminologies translational-research omop-cdm obofoundry hacktoberfest

omop2obo's Introduction

omop2obo

What is OMOP2OBO?

omop2obo is a collection of health system-wide, disease-agnostic mappings between standardized clinical terminologies in the Observational Medical Outcomes Partnership (OMOP) common data model and several Open Biomedical Ontologies (OBOs) foundry ontologies.

Motivation

Common data models have solved many challenges of utilizing electronic health records, but have not yet meaningfully integrated clinical and molecular data. Aligning clinical data to open biological ontologies (OBOs), which provide semantically computable representations of biological knowledge, requires extensive manual curation and expertise.

Objective

To address these limitations, we have developed OMOP2OBO, the first health system-wide integration and alignment between the Observational Health Data Sciences and Informatics' Observational Medical Outcomes Partnership (OMOP) standardized clinical terminologies and eight OBO biomedical ontologies spanning diseases, phenotypes, anatomical entities, cell types, organisms, chemicals, metabolites, hormones, vaccines, and proteins. To verify that the mappings are both clinically and biologically meaningful, we have performed extensive experiments to verify the accuracy, generalizability, and logical consistency of each released mapping set.

📢 Manuscript preprint is available 👉 https://doi.org/10.48550/arXiv.2209.04732

What Does This Repository Provide?

Through this repository we provide the following:

Mappings: A free set of omop2obo mappings that can be used out of the box (requires no coding) covering OMOP Conditions, Drug Exposures, and Measurements. These mappings are available in several formats including: .txt, .xlsx, and .dump. We also provide a semantic representation of the mappings, integrated with the OBO biomedical ontologies, available as an edge list (.txt) and as an .owl file. See current release for more details.
A Mapping Framework: An algorithm and mapping pipeline that enables one to construct their set of omop2obo mappings. The figure below provides a high-level overview of the algorithm workflow. The code provided in this repository facilitates all of the automatic steps shown in this figure except for the manual mapping (for now, although we are currently working on a deep learning model to address this).

How do I Learn More?

Join an existing or start a new Discussion
The Project Wiki for more details on the omop2obo mappings, algorithm, and information on the experiments we ran to ensure each mapping set released is accurate, generalizable, and consistent!
A Zenodo Community has been established to provide access to software releases, presentations, and preprints related to this project

Releases

All code and mappings for each release are free to download, see Wiki
Please see our dashboard to get current stats on available mappings and for links to download them.

Current Release:

v1.0.0 ➞ data and code can be directly downloaded here.

Condition Occurrence Mappings: https://doi.org/10.5281/zenodo.6774363

Drug Exposure Ingredient Mappings: https://doi.org/10.5281/zenodo.6774401

Measurement Mappings: https://doi.org/10.5281/zenodo.6774443

Getting Started

Install Library

This program requires Python version 3.6. To install the library from PyPI, run:

pip install omop2obo

You can also clone the repository directly from GitHub by running:

git clone https://github.com/callahantiff/OMOP2OBO.git

Set-Up Environment

The omop2obo library requires a specific project directory structure. Please make sure that your project directory includes the following sub-directories:

OMOP2OBO/
    |
    |---- resources/
    |         |
    |     clinical_data/
    |         |
    |     mappings/
    |         |
    |     ontologies/

Results will be output to the mappings directory.

Dependencies

APPLICATIONS

This software also relies on OWLTools. If cloning the repository, the owltools library file will automatically be included and placed in the correct repository.
The National of Library Medicine's Unified Medical Language System (UMLS) MRCONSO and MRSTY. Using these data requires a license agreement. Note that in order to get the MRSTY file you will need to download the UMLS Metathesaurus and run MetamorphoSys. Once both data sources are obtained, please place the files in the resources/mappings directory.

DATA

Clinical Data: This repository assumes that the clinical data that needs mapping has been placed in the resources/clinical_data repository. Each data source provided in this repository is assumed to have been extracted from the OMOP CDM. An example of what is expected for this input can be found here.
Ontology Data: Ontology data is automatically downloaded from the user provided input file ontology_source_list.txt (here).
Vocabulary Source Code Mapping: To increase the likelihood of capturing existing database cross-references, omop2obo provides a file that maps different clinical vocabulary source code prefixes between the UMLS, ontologies, and clinical EHR data (i.e. "SNOMED", "SNOMEDCT", "SNOMEDCT_US") source_code_vocab_map.csv (here). Please note this file builds off of these UMLS provided abbreviation mappings. Currently, this file is updated for ontologies released july 2020, clinical data normlaized to OMOP_v5.0, and UMLS 2020AA.
Semantic Mapping Representation: In order to create a semantic representation of the omop2obo mappings, an ontological specification for creating classes that span multiple ontologies (reosurces/mapping_semantics/omop2obo). This document only needs to be altered if you plan to utilize the semantic mapping transformation algorithm and want to use a different knowledge representation. Please the following README for additional details on these resources.

Running the omop2obo Library

There are a few ways to run omop2obo. An example workflow is provided below.

import glob
import pandas as pd
import pickle

from datetime import date, datetime

from omop2obo import ConceptAnnotator, OntologyDownloader, OntologyInfoExtractor, SimilarStringFinder


# set some global variables
outfile = 'resources/mappings/OMOP2OBO_MAPPED_'
date_today = '_' + datetime.strftime(datetime.strptime(str(date.today()), '%Y-%m-%d'), '%d%b%Y').upper()

# download ontologies
ont = OntologyDownloader('resources/ontology_source_list.txt')
ont.downloads_data_from_url()

# process ontologies
ont_explorer = OntologyInfoExtractor('resources/ontologies', ont.data_files)
ont_explorer.ontology_processor()

# create master dictionary of processed ontologies
ont_explorer.ontology_loader()

# read in ontology data
with open('resources/ontologies/master_ontology_dictionary.pickle', 'rb') as handle:
    ont_data = pickle.load(handle)
handle.close()

# process clinical data
mapper = ConceptAnnotator(clinical_file='resources/clinical_data/omop2obo_conditions_june2020.csv',
                          ontology_dictionary={k: v for k, v in ont_data.items() if k in ['hp', 'mondo']},
                          merge=True,
                          primary_key='CONCEPT_ID',
                          concept_codes=tuple(['CONCEPT_SOURCE_CODE']),
                          concept_strings=tuple(['CONCEPT_LABEL', 'CONCEPT_SYNONYM']),
                          ancestor_codes=tuple(['ANCESTOR_SOURCE_CODE']),
                          ancestor_strings=tuple(['ANCESTOR_LABEL']),
                          umls_mrconso_file=glob.glob('resources/mappings/*MRCONSO*')[0] if len(glob.glob('resources/mappings/*MRCONSO*')) > 0 else None,
                          umls_mrsty_file=glob.glob('resources/mappings/*MRCONSO*')[0] if len(glob.glob('resources/mappings/*MRCONSO*')) > 0 else None)

   exact_mappings = mapper.clinical_concept_mapper()
   exact_mappings.to_csv(outfile + 'CONDITIONS' + date_today + '.csv', sep=',', index=False, header=True)
   # get column names -- used later to organize output
   start_cols = [i for i in exact_mappings.columns if not any(j for j in ['STR', 'DBXREF', 'EVIDENCE'] if j in i)]
   exact_cols = [i for i in exact_mappings.columns if i not in start_cols]

   # perform similarity mapping
   if tfidf_mapping is not None:
       sim = SimilarStringFinder(clinical_file=outfile + 'CONDITIONS' + date_today + '.csv',
                                 ontology_dictionary={k: v for k, v in ont_data.items() if k in ['hp', 'mondo']},
                                 primary_key='CONCEPT_ID',
                                 concept_strings=tuple(['CONCEPT_LABEL', 'CONCEPT_SYNONYM']))

       sim_mappings = sim.performs_similarity_search()
       sim_mappings = sim_mappings[['CONCEPT_ID'] + [x for x in sim_mappings.columns if 'SIM' in x]].drop_duplicates()
       # get column names -- used later to organize output
       sim_cols = [i for i in sim_mappings.columns if not any(j for j in start_cols if j in i)]

       # merge dbXref, exact string, and TF-IDF similarity results
       merged_scores = pd.merge(exact_mappings, sim_mappings, how='left', on='CONCEPT_ID')
       # re-order columns and write out data
       merged_scores = merged_scores[start_cols + exact_cols + sim_cols]
       merged_scores.to_csv(outfile + clinical_domain.upper() + date_today + '.csv', sep=',', index=False, header=True)

COMMAND LINE ➞ main.py

python main.py --help
Usage: main.py [OPTIONS]

The OMOP2OBO package provides functionality to assist with mapping OMOP standard clinical terminology
concepts to OBO terms. Successfully running this program requires several input parameters, which are
specified below:


PARAMETERS:
    ont_file: 'resources/oontology_source_list.txt'
    tfidf_mapping: "yes" if want to perform cosine similarity mapping using a TF-IDF matrix.
    clinical_domain: clinical domain of input data (i.e. "conditions", "drugs", or "measurements").
    merge: A bool specifying whether to merge UMLS SAB codes with OMOP source codes once or twice.
    onts: A comma-separated list of ontology prefixes that matches 'resources/oontology_source_list.txt'.
    clinical_data: The filepath to the clinical data needing mapping.
    primary_key: The name of the file to use as the primary key.
    concept_codes: A comma-separated list of concept-level codes to use for DbXRef mapping.
    concept_strings: A comma-separated list of concept-level strings to map to use for exact string mapping.
    ancestor_codes: A comma-separated list of ancestor-level codes to use for DbXRef mapping.
    ancestor_strings: A comma-separated list of ancestor-level strings to map to use for exact string mapping.
    outfile: The filepath for where to write output data to.

Several dependencies must be addressed before running this file. Please see the README for instructions.

Options:
  --ont_file PATH          [required]
  --tfidf_mapping TEXT     [required]
  --clinical_domain TEXT   [required]
  --merge                  [required]
  --ont TEXT               [required]
  --clinical_data PATH     [required]
  --primary_key TEXT       [required]
  --concept_codes TEXT     [required]
  --concept_strings TEXT
  --ancestor_codes TEXT
  --ancestor_strings TEXT
  --outfile TEXT           [required]
  --help                   Show this message and exit.

If you follow the instructions for how to format clinical data (here) and/or if taking the data that results from running our queries here), omop2obo can be run with the following call on the command line (with minor updates to the csv filename):

python main.py --clinical_domain condition --onts hp --onts mondo --clinical_data resources/clinical_data/omop2obo_conditions_june2020.csv

JUPYTER NOTEBOOK ➞ omop2obo_notebook.ipynb

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

License

This project is licensed under MIT - see the LICENSE.md file for details.

Citing this Work

@software{callahan_tiffany_j_2020_3902767,
          author     =  {Callahan, Tiffany J},
          title      = {OMOP2OBO},
          month      = jun,
          year       = 2020,
          publisher  = {Zenodo},
          version    = {v1.0.0},
          doi        = {10.5281/zenodo.3902767},
          url        = {https://doi.org/10.5281/zenodo.3902767}.
   }

Contact

We’d love to hear from you! To get in touch with us, please join or start a new Discussion, create an issue or send us an email 💌

omop2obo's People

Contributors

Stargazers

Watchers

Forkers

onlyrohits ablack3 cthoyt-forks-and-packages sanyabt anyuanay mattstammers lenapheno ulc0 bsc-health-data lemaslab patcpayne fatima0606

omop2obo's Issues

TODO - Pubs+Presentation Task: Deposit versions in Zenodo to get citable DOI

Deposit versions in Zenodo to get citable DOI

Add Jupyter Notebooks

Task: Add Jupyter Notebooks for the following:

Running the mapping pipeline -- take existing content from main.py and organize it in a similar fashion as done for PheKnowLator (we've had good feedback that this approach is user friendly)
Performing coverage studies using Concept Prevalence data
Aligning outside OMOP concepts to the current main mapping file and outputting useful information for next steps

TODO - Coding: Condition annotation verification

Needed Scripts: verify code to create condition code mapping

Map raw SNOMED-CT to CUI and Semantic Types (UMLS API to retrieve both OR download MRCONSO and SEMTYPE files from UMLS)
Code to get synonyms, labels, and defs from ontologies (write code to download data)
Similarity code

Inputs: pandas data frame or results of SQL query
Action: maps clinical concepts to different open biomedical ontologies (OBOs)
Output: csv file with suggested mappings between condition codes and OBOs

Extension: Extend mapping to take advantage of existing UMLS mappings in ontologies

Currently, the mapping code for DbXRefs is designed to map OMOP source codes to UMLS SABs. There is potential to easily extend this to take advantage of existing UMLS mappings provided by the ontologies. This is not a mandatory bug, but is a change to the current pipeline that could grab additional mappings for free.

Coding: Parallelize tf-idf cosine similarity code

Script: string_similarity.py

Needed Changes:

Parallelize the scores_tfidf() method to run each input ontology ontology_type in parallel.
Aggregate/merge mapping results for each ontology

Improve string delimiter detection in mapping pipline

Describe the Bug

An assumption is made that all concept synonyms and ancestor information will be input in an aggregated format with each aggregated concept separated by a | delimiter. That's a brittle assumption that should be improved. Examples of specs for input data can be found here: resources/clinical_data/README.md

EXAMPLE:
Input Data
The CONCEPT_SYNONYM column below displays data in the expected input format

CONCEPT_ID	CONCEPT_SOURCE_CODE	CONCEPT_LABEL	CONCEPT_SOURCE_LABEL	CONCEPT_SYNONYM
37018594	snomed:80251000119104	Complement level below reference range	Complement level below reference range	Complement level below reference range \| Complement level below reference range (finding)

Example of Data that Breaks Assumptions:
The CONCEPT_SYNONYM column below displays data in an unexpected input format (i.e. two types of delimiters | and ;)

CONCEPT_ID	CONCEPT_SOURCE_CODE	CONCEPT_LABEL	CONCEPT_SYNONYM
40771573	loinc:69052-9	Flow cytometry specialist review of results	Flow cytometry specialist review of results \| Flow cytometry specialist review \| Dynamic; Impression; Impression/interpretation of study; Impressions; Interp; Interpretation; Misc; Miscellaneous; Narrative; Other; Point in time; Random; Report; To be specified in another part of the message; Unspecified

Impact Level

LOW - the string similarity mapping pipeline correctly handles all types of pipings allowing the recovery of missed mappings in the exact match part of the pipeline.

Impacted Scripts

omop2obo/clinical_concept_annotator.py

Solution

Add a parameter to pass delimiter type
Improve tests to better vette

SQL Verification (OMOP Concepts): Drug_Exposure Concepts

PURPOSE: This query is designed to query the OMOP drug_exposure, concept, concept_synonym, concept_ancestor, and vocabulary tables.

QUERY TYPE: OMOP Concepts
RUNTIME: 26.4 seconds
RESULTS: 11,949 rows; 9,175 unique drugs and 1,697 unique ingredients

TASK: @mgkahn - Anything you think I should change or anything that needs improvement?

WITH 
  drug_concepts
  AS (SELECT
        d.drug_concept_id AS CONCEPT_ID,
        c.concept_code AS CONCEPT_SOURCE_CODE, 
        c.concept_name AS CONCEPT_LABEL,
        c.vocabulary_id AS CONCEPT_VOCAB,
        v.vocabulary_version AS CONCEPT_VOCAB_VERSION
      FROM 
        CHCO_DeID_Oct2018.drug_exposure d 
        JOIN CHCO_DeID_Oct2018.concept c ON d.drug_concept_id = c.concept_id
        JOIN CHCO_DeID_Oct2018.vocabulary v ON c.vocabulary_id = v.vocabulary_id
      WHERE 
        c.concept_name != "No matching concept" 
        AND c.domain_id = "Drug"
      GROUP BY CONCEPT_ID, CONCEPT_SOURCE_CODE, CONCEPT_LABEL, CONCEPT_VOCAB, CONCEPT_VOCAB_VERSION),
  
  drug_ancestors
  AS (SELECT
        ca.descendant_concept_id AS CONCEPT_ID,
        STRING_AGG(DISTINCT(CAST(c1.concept_id as STRING)), " | ") AS ANCESTOR_CONCEPT_ID,
        STRING_AGG(DISTINCT(c1.concept_code), " | ") AS ANCESTOR_SOURCE_CODE, 
        STRING_AGG(DISTINCT(c1.concept_name), " | ") AS ANCESTOR_LABEL,
        STRING_AGG(DISTINCT(c1.vocabulary_id), " | ") AS ANCESTOR_VOCAB,
        STRING_AGG(DISTINCT(v.vocabulary_id), " | ") AS ANCESTOR_VOCAB_VERSION
      FROM 
        CHCO_DeID_Oct2018.concept_ancestor ca
        JOIN CHCO_DeID_Oct2018.concept c1 ON ca.ancestor_concept_id = c1.concept_id
        JOIN CHCO_DeID_Oct2018.vocabulary v ON c1.vocabulary_id = v.vocabulary_id
      WHERE 
        ca.descendant_concept_id IN (SELECT CONCEPT_ID FROM drug_concepts)
        AND c1.concept_name != "No matching concept"
        AND c1.concept_id IS NOT NULL
        AND c1.domain_id = "Drug"
      GROUP BY CONCEPT_ID),
  
  drug_synonyms
  AS (SELECT 
        s.concept_id AS CONCEPT_ID,
        STRING_AGG(DISTINCT(s.concept_synonym_name), " | ") AS CONCEPT_SYNONYM
        FROM CHCO_DeID_Oct2018.concept_synonym s 
        WHERE s.concept_id in (SELECT CONCEPT_ID FROM drug_concepts)
        GROUP BY CONCEPT_ID),
        
  ingredient_concepts
  AS (SELECT
        ca.descendant_concept_id AS CONCEPT_ID,
        c1.concept_id AS INGREDIENT_CONCEPT_ID,
        c1.concept_code AS INGREDIENT_SOURCE_CODE, 
        c1.concept_name AS INGREDIENT_LABEL,
        c1.vocabulary_id AS INGREDIENT_VOCAB,
        v.vocabulary_version AS INGREDIENT_VOCAB_VERSION
      FROM 
        CHCO_DeID_Oct2018.concept_ancestor ca
        JOIN CHCO_DeID_Oct2018.concept c1 ON ca.ancestor_concept_id = c1.concept_id
        JOIN CHCO_DeID_Oct2018.vocabulary v ON c1.vocabulary_id = v.vocabulary_id
      WHERE 
        ca.descendant_concept_id IN (SELECT CONCEPT_ID FROM drug_concepts)
        AND c1.concept_name != "No matching concept"
        AND c1.concept_id IS NOT NULL
        AND c1.domain_id = "Drug"
        AND c1.concept_class_id = "Ingredient"
      GROUP BY CONCEPT_ID, INGREDIENT_CONCEPT_ID, INGREDIENT_SOURCE_CODE, INGREDIENT_LABEL, INGREDIENT_VOCAB, INGREDIENT_VOCAB_VERSION),
  
  ingredient_ancestors
  AS (SELECT
        ca.descendant_concept_id AS CONCEPT_ID,
        STRING_AGG(DISTINCT(CAST(c1.concept_id as STRING)), " | ") AS INGRED_ANCESTOR_CONCEPT_ID,
        STRING_AGG(DISTINCT(c1.concept_code), " | ") AS INGRED_ANCESTOR_SOURCE_CODE, 
        STRING_AGG(DISTINCT(c1.concept_name), " | ") AS INGRED_ANCESTOR_LABEL,
        STRING_AGG(DISTINCT(c1.vocabulary_id), " | ") AS INGRED_ANCESTOR_VOCAB,
        STRING_AGG(DISTINCT(v.vocabulary_version), " | ") AS INGRED_ANCESTOR_VOCAB_VERSION
      FROM 
        CHCO_DeID_Oct2018.concept_ancestor ca
        JOIN CHCO_DeID_Oct2018.concept c1 ON ca.ancestor_concept_id = c1.concept_id
        JOIN CHCO_DeID_Oct2018.vocabulary v ON c1.vocabulary_id = v.vocabulary_id
      WHERE 
        ca.descendant_concept_id IN (SELECT CONCEPT_ID FROM ingredient_concepts)
        AND c1.concept_name != "No matching concept"
        AND c1.concept_id IS NOT NULL
        AND c1.domain_id = "Drug"
      GROUP BY CONCEPT_ID),
      
  ingredient_synonyms
  AS (SELECT 
        s.concept_id AS INGREDIENT_CONCEPT_ID,
        STRING_AGG(DISTINCT(s.concept_synonym_name), " | ") AS INGREDIENT_SYNONYM
        FROM CHCO_DeID_Oct2018.concept_synonym s 
        WHERE s.concept_id in (SELECT INGREDIENT_CONCEPT_ID FROM ingredient_concepts)
        GROUP BY INGREDIENT_CONCEPT_ID)

SELECT
  d.CONCEPT_ID,
  d.CONCEPT_SOURCE_CODE,
  d.CONCEPT_LABEL,
  d.CONCEPT_VOCAB,
  d.CONCEPT_VOCAB_VERSION,
  s.CONCEPT_SYNONYM,
  a.ANCESTOR_CONCEPT_ID,
  a.ANCESTOR_SOURCE_CODE, 
  a.ANCESTOR_LABEL,
  a.ANCESTOR_VOCAB,
  a.ANCESTOR_VOCAB_VERSION,
  i.INGREDIENT_CONCEPT_ID,
  i.INGREDIENT_SOURCE_CODE,
  i.INGREDIENT_LABEL,
  i.INGREDIENT_VOCAB,
  i.INGREDIENT_VOCAB_VERSION,
  s2.INGREDIENT_SYNONYM,
  a2.INGRED_ANCESTOR_CONCEPT_ID,
  a2.INGRED_ANCESTOR_SOURCE_CODE,
  a2.INGRED_ANCESTOR_LABEL,
  a2.INGRED_ANCESTOR_VOCAB,
  a2.INGRED_ANCESTOR_VOCAB_VERSION
  
FROM drug_concepts d
  FULL JOIN drug_ancestors a ON d.CONCEPT_ID = a.CONCEPT_ID
  FULL JOIN drug_synonyms s ON d.CONCEPT_ID = s.CONCEPT_ID
  FULL JOIN ingredient_concepts i ON d.CONCEPT_ID = i.CONCEPT_ID
  FULL JOIN ingredient_ancestors a2 ON i.CONCEPT_ID = a2.CONCEPT_ID
  FULL JOIN ingredient_synonyms s2 ON i.INGREDIENT_CONCEPT_ID = s2.INGREDIENT_CONCEPT_ID;

TODO - Project Organization: Write project README

Task: Provide a project README

Migrate from TravisCI to GitHub Actions

Task

Migrate from TravisCI to GitHub Actions

Description

TravisCI is no longer free and a new CI framework is needed. Github has provided some documentation on how to do this here

Coding: Creating Mapping Testing Pipeline

GOALS: Create a pipeline that can be used to extend pediatric clinical-concept mappings to a new data source.

Workflow Update: @SteeleRobert has agreed to help with creating this pipeline.

Background:
We have created a large set of mappings from clinical diagnoses (n=29,128), medications (n=9,175 unique medications or 1,693 unique ingredients), and measurements (n=2,703 unique measurement results) to open biomedical ontologies.

TODO:

Build a pipeline that performs multi-label classification.
- Code should take in a set of OMOP codes and output mappings, with a confidence score to a specific set of ontologies:
  - Diagnosis codes ➞ Human Phenotype and Human Disease Ontologies
  - Medications ➞ NCBITaxon, Protein Ontology, ChEBI, and Vaccine Ontology
  - Measurement ➞ Human Phenotype
- Can test code on UCHealth adult data set, MIMIC, and PIC

General Guidelines:

Build scripts in an object-oriented framework
- Sketch architecture before building code
- General parsing class with subclasses by clinical type (i.e. conditions, measurements, medications)
Test-driven development
Needs to be written using keras and TensorFlow
Inputs: a list of clinical codes, a list of ontologies

NEXT STEPS:

Discuss this issue
Agree on roles and authorship
Discuss plan for moving forward, starting with discussions of architecture, prior to beginning coding

@SteeleRobert - are you good with this plan?

Coding: add requirements.txt

Add a requirements.txt to make it easier to create an environment with required libs.

TODO - Mapping: Verify Lab test mapping

Verify the last few lab test results with Dr. Vasilevsky.

Coding: adding helper function to leverage OHDSIAnanke

Needed Scripts: Simple extension added to helper functions to leverage functionality described in OHDSIAnanke. This should be a very simple and straightforward task since we almost do exactly what this method does, with some very minor extensions.

TODO - Mapping: Finalize medication annotations

Verify medication mappings with Jessica. Also need to update remaining mappings to PRO.

Create YouTube Demo

Task: Create a simple YouTube video to accommodate the repo that provides:

A brief overview of the project
Requirements/dependencies
How to run the code in the main Jupyter Notebook
How to access current OMOP2OBO releases

Project Meeting -- 05/28/2020 @ 12:00

Meeting Date: May 28, 2020
Topic: Brief description of meeting topic
Attendees: @mgkahn

Proposed Agenda:

Plan for verifying mapping coverage in additional OMOP Health Systems
Discuss recent COVID-19 work done by OHDSI group and discuss plan to demonstrate impact of adding OMOP2OBO mappings
Integration plan and outreach

TODO - Coding: Update medication mapping code

Needed Scripts: Update medication concept annotation code

Code to map RxNORM strings to DrugBank and return synonyms

Inputs: pandas data frame
Output: csv with maps between medication concepts and OBOs

TODO: Mapping Release Formatting

Due to some of the OMOP clinical vocabularies having licensing restrictions, we need to decide how and in what format the mappings will be publicly released.

Regardless of the above issue, we will be releasing the mappings in at least the following formats:

Analytics set that has been lossy compressed such that only OMOP concept-ids are shown and users can recover the underlying source codes from running the code in this repo
Releasing RDF version with provenance covering the evidence behind the mappings
Releasing all mappings at all levels (i.e. concept and ancestor) as Excel spreadsheets and a normalized database dump

running without a mrconso file has an error‘

Describe the bug
running without a mrconso file creates an error

To Reproduce
Steps to reproduce the behavior:
(test_env) CARD-CRoeder:OMOP2OBO christopherroeder$ time ./main.py --clinical_domain condition --clinical_data resources/clinical_data/sample_conditions.csv

Expected behavior
no error

Screenshots

*** Annotating Level: concept
Performing UMLS CUI + Semantic Type Annotation
Traceback (most recent call last):
  File "./main.py", line 149, in <module>
    main()
  File "/Users/christopherroeder/work/git_misc/OMOP2OBO/test_env/lib/python3.8/site-packages/click/core.py", line 1025, in __call__
    return self.main(*args, **kwargs)
  File "/Users/christopherroeder/work/git_misc/OMOP2OBO/test_env/lib/python3.8/site-packages/click/core.py", line 955, in main
    rv = self.invoke(ctx)
  File "/Users/christopherroeder/work/git_misc/OMOP2OBO/test_env/lib/python3.8/site-packages/click/core.py", line 1279, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/christopherroeder/work/git_misc/OMOP2OBO/test_env/lib/python3.8/site-packages/click/core.py", line 710, in invoke
    return callback(*args, **kwargs)
  File "./main.py", line 97, in main
    mappings = mapper.clinical_concept_mapper()
  File "/Users/christopherroeder/work/git_misc/OMOP2OBO/omop2obo/clinical_concept_annotator.py", line 373, in clinical_concept_mapper
    if self.umls_cui_data is not None and self.umls_tui_data is not None:
AttributeError: 'ConceptAnnotator' object has no attribute 'umls_cui_data'

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]
OS: [e.g. iOS8.1]
Browser [e.g. stock browser, safari]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

TODO - Coding: OBO update Watcher

Needed Scripts: Write code for OBOs that monitors issues and pings when a better match has been added or when a term has been deprecated

SQL Verification - OMOP Coverage V2

PURPOSE: This query is designed to query an OMOP instance and return 2 columns. This query makes the assumption that the other shops cannot share detailed results and thus we calculate and return simple coverage statistics.

COVERED_CONCEPTS - The number of concepts covered by our mappings
TOTAL_CONCEPTS - The total number of mapped concepts
DOMAIN - A string indicating the clinical domain each of the concept codes belong to

QUERY TYPE: OMOP Coverage V2
RUNTIME: 36.4 seconds
RESULTS: 3 rows (i.e. unique CONCEPT_IDs)

TASK: @mgkahn - What do you think about this?

QUERIES:

SQLRender Query Templates:

Wiki: Broken links on V1.0 release page

Wiki Page:

https://github.com/callahantiff/OMOP2OBO/wiki/V1.0

Suggestions:

"All data used for this release can e downloaded directly from Zenodo (here) http://doi.org/10.5281/zenodo.4247939 " -> DOI not found
Zenodo.org links under Ontologies section point to "404: resource not found"

TODO - Project Organization: write wiki page

Task: project wiki

Description: write a project wiki that describes why this work is being done and what it will be used for

TODO - Mapping: Procedure Annotations

Needed Scripts: verify code to create procedure code mapping

Code to get synonyms, labels, and defs from ontologies (write code to download data)
Similarity code
Generate mappings
Verify mappings

Coding: Add RDF-ization code to convert mappings to RDF

Needed Scripts: Write a script that converts mappings into RDF

@nicolevasilevsky -- thank you for meeting with me a few weeks ago and confirming our approach looks reasonable. I am just documenting this here as an issue since it's work I still need to do.

Planned Approach

NOT()
Details: Only occurs within the HP and only for Measurement and Drug domains
class_IRI: https://github.com/callahantiff/omop2obo/obo/ext/OMOP_4021360
Class_Name: 'Skin appearance normal'
Class Expression Syntax: not('Abnormality of the skin')

New Triples:

omop2obo: <https://github.com/callahantiff/omop2obo/obo/ext/>
oboInOwl: <http://www.geneontology.org/formats/oboInOwl>
owl: <http://www.w3.org/2002/07/owl>
rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns>
rdfs: <http://www.w3.org/2000/01/rdf-schema>

omop2obo:OMOP_4021360 oboInOwl:hasOBONamespace OMOP2OBO
omop2obo:OMOP_4021360 oboInOwl:id OMOP:4021360  
omop2obo:OMOP_4021360, rdf:type, owl:Class
omop2obo:OMOP_4021360, rdfs:label, 'Skin appearance normal'

omop2obo:OMOP_4021360, owl:equivalentClass, ec1
ec1, rdf:type, owl:Class
ec1, owl:complementOf, obo:HP_0000951

OR()
Details: Only occurs within DOID and HP and only for the Condition domain
class_IRI: https://github.com/callahantiff/omop2obo/obo/ext/OMOP_434473
Class_Name: 'Longitudinal deficiency of tibia AND/OR fibula'
Class Expression Syntax: ('Abnormality of fibula morphology' or 'Abnormality of tibia morphology')

New Triples:

omop2obo: <https://github.com/callahantiff/omop2obo/obo/ext/>
oboInOwl: <http://www.geneontology.org/formats/oboInOwl>
owl: <http://www.w3.org/2002/07/owl>
rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns>
rdfs: <http://www.w3.org/2000/01/rdf-schema>

omop2obo:OMOP_434473 oboInOwl:hasOBONamespace OMOP2OBO
omop2obo:OMOP_434473 oboInOwl:id OMOP:434473  
omop2obo:OMOP_434473, rdfs:label, "Longitudinal deficiency of tibia AND/OR fibula"
omop2obo:OMOP_434473, rdf:type, owl:Class
omop2obo:OMOP_434473, owl:equivalentClass, ec1
 
ec1, rdf:type, owl:Class
ec1, owl:unionOf _ec1_union1
ec1_union1, rdf:first, obo:HP_0002991
ec1_union1 rdf:type rdf:list
 
ec1_union1, rdf:rest,  ec1_union2
ec1_union2 , rdf:first, obo:HP_0002992
ec1_union2, rdf:rest, rdf:nil

AND()
Details: Occurs within all ontologies and domains
class_IRI: https://github.com/callahantiff/omop2obo/obo/ext/OMOP_434165
Class_Name: 'Abnormal cervical smear'
Class Expression Syntax: ('Abnormal cell morphology' and 'Abnormality of the uterine cervix')

New Triples:

omop2obo: <https://github.com/callahantiff/omop2obo/obo/ext/>
oboInOwl: <http://www.geneontology.org/formats/oboInOwl>
owl: <http://www.w3.org/2002/07/owl>
rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns>
rdfs: <http://www.w3.org/2000/01/rdf-schema>

omop2obo:OMOP_434165 oboInOwl:hasOBONamespace OMOP2OBO
omop2obo:OMOP_434165 oboInOwl:id OMOP:434165
omop2obo:OMOP_434165, rdfs:label, "Abnormal cervical smear"
omop2obo:OMOP_434165, rdf:type, owl:Class
omop2obo:OMOP_434165, owl:equivalentClass, ec1
 
ec1, rdf:type, owl:Class
ec1, owl:intersectionOf, ec_intersection1
ec_intersection1,  rdf:first, obo: HP_0012888
ec_intersection1, rdf:rest,  ec_intersection2
 
ec_intersection1,  rdf:first, obo:HP_0025461
ec_intersection2, rdf:rest, rdf:nil

AND()/OR()
Details: Only occurs within DOID and HP and only for the Condition domain
class_IRI: https://github.com/callahantiff/omop2obo/obo/ext/OMOP_77072
Class_Name: 'Joint effusion of ankle AND/OR foot'
Class Expression Syntax:

('Joint swelling' and 'Abnormality of the ankles')
or
('Joint swelling' and 'Abnormality of the foot')

New Triples:

omop2obo: <https://github.com/callahantiff/omop2obo/obo/ext/>
oboInOwl: <http://www.geneontology.org/formats/oboInOwl>
owl: <http://www.w3.org/2002/07/owl>
rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns>
rdfs: <http://www.w3.org/2000/01/rdf-schema>

omop2obo:OMOP_77072 oboInOwl:hasOBONamespace OMOP2OBO
omop2obo:OMOP_77072 oboInOwl:id OMOP:77072
omop2obo:OMOP_77072,rdfs:label, "Joint effusion of ankle AND/OR foot"
omop2obo:OMOP_77072, rdf:type, owl:Class,
omop2obo:OMOP_77072, owl:equivalentClass, ec1
 
ec1, rdf:type, owl:Class
ec1, owl:unionOf, ec_union1
ec_union1, rdf:type, rdf:List
 
ec_union1, rdf:first, ec_union_member_1
ec_union_member_1, rdf:type, owl:Class
ec_union_member_1, owl:intersectionOf, ec_intersection1
ec_intersection1, rdf:type, rdf:List
 
ec_intersection1, rdf:first, obo:HP_0001386
ec_intersection1, rdf:rest, ec_intersection1b
ec_intersection1b, rdf:type, rdf:List
ec_intersection1b, rdf:first,  obo:HP_0001760
ec_intersection1b, rdf:rest, rdf:nil
 
ec_union1, rdf:rest, ec_union_2
ec_union_2, rdf:type, rdf:List
ec_union_2, rdf:rest, rdf:nil
 
ec_union_2, rdf:first, ec_union_member_2
ec_union_member_2, rdf:type, owl:Class
ec_union_member_2, owl:intersectionOf, ec_intersection2
ec_intersection2, rdf:type, rdf:List
 
ec_intersection2, rdf:first, obo:HP_0001386
ec_intersection2, rdf:rest, ec_intersection2b
ec_intersection2b, rdf:type, rdf:List
ec_intersection2b, rdf:first, obo:HP_0003028
ec_intersection2b, rdf:rest, rdf:nil

AND()/NOT()
Details: Only occurs within DOID and HP and only for the Condition domain
class_IRI: https://github.com/callahantiff/omop2obo/obo/ext/OMOP_4120313
Class_Name: 'Non-diabetic disorder of endocrine pancreas'
Class Expression Syntax:
'Abnormality of the pancreas' and not('has phenotype' some 'Diabetes mellitus')

New Triples:

omop2obo: <https://github.com/callahantiff/omop2obo/obo/ext/>
oboInOwl: <http://www.geneontology.org/formats/oboInOwl>
owl: <http://www.w3.org/2002/07/owl>
rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns>
rdfs: <http://www.w3.org/2000/01/rdf-schema>

omop2obo:OMOP_4120313 oboInOwl:hasOBONamespace OMOP2OBO
omop2obo:OMOP_4120313 oboInOwl:id OMOP:4120313
omop2obo:OMOP_4120313, rdfs:label, "Non-diabetic disorder of endocrine pancreas"
omop2obo:OMOP_4120313, rdf:type, owl:Class
omop2obo:OMOP_4120313, owl:equivalentClass, ec1
 
ec1, owl:someValuesFrom, ec1_intersection1
ec1_intersection1, rdf:type, owl:Class
ec1_intersection1, owl:intersectionOf,  ec1_intersection_member1
ec1_intersection_member1 , rdf:first, obo:HP_0001732
ec1_intersection_member1 , rdf:type, rdf:List
 
ec1_intersection_member1 , rdf:rest, ec1_intersection_member2
ec1_intersection_member2, rdf:first,  ec1_complement
ec1_intersection_member2, rdf:rest, rdf:nil
ec1_intersection_member2 , rdf:type, rdf:List
ec1_complement, owl:complementOf, obo:HP_0000819

AND()/OR()/NOT()
Details: Only occurs within DOID and HP and only for the Condition domain
class_IRI: https://github.com/callahantiff/omop2obo/obo/ext/OMOP_435352
Class_Name: 'Periostitis without osteomyelitis, of the pelvic region and/or thigh'
Class Expression Syntax:

((Periostitis and 'Abnormality of femur morphology')
    and not(Osteomyelitis and 'Abnormality of femur morphology'))
or
((Periostitis and 'Abnormality of pelvic girdle bone morphology')
    and not(Osteomyelitis and 'Abnormality of pelvic girdle bone morphology'))

New Triples:

omop2obo: <https://github.com/callahantiff/omop2obo/obo/ext/>
oboInOwl: <http://www.geneontology.org/formats/oboInOwl>
owl: <http://www.w3.org/2002/07/owl>
rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns>
rdfs: <http://www.w3.org/2000/01/rdf-schema>

_:pelvic rdf:label "Periostitis without osteomyelitis, of the pelvic region"
_:pelvic rdf:type owl:Class
_:pelvic owl:equivalentClass _:ecp
_:ecp owl:intersectionOf _:ecp1
_:ecp1 rdf:type rdf:List
_:ecp1 rdf:first obo:HP_0002644 # abnormality of pelvic region
_:ecp1 rdf:rest _:ecp2
_:ecp2 rdf:type rdf:List
_:ecp2 rdf:first HP_0040165 # periostitis
_:ecp2 rdf:rest _:ecp3
_:ecp3 rdf:type rdf:List
_:ecp3 rdf:first _:ecp4
_:ecp4 rdf:type owl:Class
_:ecp4 owl:complementOf obo:HP_0002754 # osteomyelitis
_:ecp3 rdf:rest rdf:nil
 
_:thigh rdf:label "Periostitis without osteomyelitis, of the thigh"
_:thigh rdf:type owl:Class
_:thigh owl:equivalentClass _:ect
_:ect owl:intersectionOf _:ect1
_:ect1 rdf:type rdf:List
_:ect1 rdf:first obo:HP_0002823 # abnormality of femur morphology
_:ect1 rdf:rest _:ect2
_:ect2 rdf:type rdf:List
_:ect2 rdf:first HP_0040165 # periostitis
_:ect2 rdf:rest _:ect3
_:ect3 rdf:type rdf:List
_:ect3 rdf:first _:ect4
_:ect4 rdf:type owl:Class
_:ect4 owl:complementOf obo:HP_0002754 # osteomyelitis
_:ect3 rdf:rest rdf:nil
 
omop2obo:OMOP_435352 oboInOwl:hasOBONamespace OMOP2OBO
omop2obo:OMOP_435352 oboInOwl:id OMOP:435352
omop2obo:OMOP_435352, rdfs:label, "Periostitis without osteomyelitis, of the pelvic region and/or thigh"
omop2obo:OMOP_435352, rdf:type, owl:Class
omop2obo:OMOP_435352, owl:equivalentClass, ec1
 
ec1 rdf:type owl:Class
ec1, owl:unionOf, ec2
ec2 rdf:type rdf:List
ec2, rdf:first, _:pelvic
ec2 rdf:rest ec3
ec3 rdf:type rdf:List
ec3 rdf:first _:thigh
ec3 rdf:rest rdf:nil

Coding: Add tests

Needed Scripts: need to bring up testing framework for code that helps match terminology codes to ontology concepts.

Add vanity CLI

If the main.py were included in the source code hierarchy, the python entrypoints in setuptools could be used ot make a vanity cli called omop2obo along with installation of the code. This would make it much more extensible for others since they wouldn't have to know where the code itself was. Would you be willing to accept a PR for this?

SQL Verification - OMOP Coverage V1

PURPOSE: This query is designed to query an OMOP instance and return the 6 columns listed below. This query makes the assumption that the other shops would be willing to return some results to us, rather than calculating coverage statistics locally (that version will be coming next).

CONCEPT_ID - OMOP concept_id from the condition_occurrence, drug_exposure, and measurement tables
CONCEPT_VOCAB_VERSION - The OMOP vocabulary version used
VISIT_COUNT - The count of unique visit_occurrence_id by concept_id
PATIENT_COUNT - The count of unique person_id by concept_id

QUERY TYPE: OMOP Coverage V1
RUNTIME: 19.4 seconds
RESULTS: 39,910 rows (i.e. unique CONCEPT_IDs)

TASK: @mgkahn - What do you think about this?

WITH 
  condition_concepts
  AS (SELECT
        c.condition_concept_id AS CONCEPT_ID,
        v.vocabulary_version AS VOCABULARY_VERSION,
        COUNT(DISTINCT c.visit_occurrence_id) AS VISIT_COUNT,
        COUNT(DISTINCT c.person_id) AS PATIENT_COUNT
      FROM 
        CHCO_DeID_Oct2018.condition_occurrence c 
        JOIN CHCO_DeID_Oct2018.concept c1 ON c.condition_concept_id = c1.concept_id
        JOIN CHCO_DeID_Oct2018.vocabulary v ON c1.vocabulary_id = v.vocabulary_id
      WHERE 
        c1.concept_name != "No matching concept" 
        AND c1.domain_id = "Condition"
      GROUP BY CONCEPT_ID, VOCABULARY_VERSION),
  
  measurement_concepts
  AS (SELECT
        m.measurement_concept_id AS CONCEPT_ID,
        v.vocabulary_version AS VOCABULARY_VERSION,
        COUNT(DISTINCT m.visit_occurrence_id) AS VISIT_COUNT,
        COUNT(DISTINCT m.person_id) AS PATIENT_COUNT
      FROM 
        CHCO_DeID_Oct2018.measurement m 
        JOIN CHCO_DeID_Oct2018.concept c ON m.measurement_concept_id = c.concept_id
        JOIN CHCO_DeID_Oct2018.vocabulary v ON c.vocabulary_id = v.vocabulary_id
      WHERE 
        c.concept_name != "No matching concept" 
        AND c.domain_id = "Measurement"
      GROUP BY CONCEPT_ID, VOCABULARY_VERSION),
  
  drug_concepts
  AS (SELECT
        d.drug_concept_id AS CONCEPT_ID,
        v.vocabulary_version AS VOCABULARY_VERSION,
        COUNT(DISTINCT d.visit_occurrence_id) AS VISIT_COUNT,
        COUNT(DISTINCT d.person_id) AS PATIENT_COUNT
      FROM 
        CHCO_DeID_Oct2018.drug_exposure d 
        JOIN CHCO_DeID_Oct2018.concept c ON d.drug_concept_id = c.concept_id
        JOIN CHCO_DeID_Oct2018.vocabulary v ON c.vocabulary_id = v.vocabulary_id
      WHERE 
        c.concept_name != "No matching concept" 
        AND c.domain_id = "Drug"
      GROUP BY CONCEPT_ID, VOCABULARY_VERSION)

SELECT *
FROM
  (SELECT CONCEPT_ID, VOCABULARY_VERSION, VISIT_COUNT, PATIENT_COUNT FROM condition_concepts
    UNION DISTINCT
   SELECT CONCEPT_ID, VOCABULARY_VERSION, VISIT_COUNT, PATIENT_COUNT FROM measurement_concepts
    UNION DISTINCT
   SELECT CONCEPT_ID, VOCABULARY_VERSION, VISIT_COUNT, PATIENT_COUNT FROM drug_concepts);

QUERY FILES:

OMOPCoverage_V1.sql

SQLRender Query Templates:

OMOPCoverage_V1_Template.txt

Generate OMOP2OBO Mappings to OHDSI Cohort Definitions

TASK

Create a Jupyter Notebook (or other tool) that takes in an OHDSI Cohort Definition (from here) and returns the OMOP2OBO mappings.

@jmbanda - What do you think about this? If we had a small tool that could take in the code set(s) for a given OHDSI cohort and return the OMOP2OBO mappings for each concept.

Coding: SSSOM Output

Basic description

Feature request. Allow option for mappings outputs to be in SSSOM format.
Action: Generate mapping outputs.
Output: Rather than the existing output format, this would output mappings in SSSOM.

#30

Additional information

Implementation details

Can add a new CLI option, e.g. --output-format, with options such as 'standard' (which I guess would be what you have now), and 'sssom'.

Resources

(@matentzn: Can you comment on the main difference between codebases (1) and (2)?

https://github.com/mapping-commons/sssom-py - Source code.
https://github.com/mapping-commons/sssom - Source code. And good documentation in the README.md. Good summary of the main 3-6 fields ([subject, predicate, object] x [id, label]).
https://mapping-commons.github.io/sssom/spec/ - Full specification.
https://mapping-commons.github.io/sssom/Mapping/ - Full description of all required and optional fields.

Coding: Parallelize mapping aggregation results

Needed Scripts: Improve the aggregates_mapping_results() in the data_utils.py script

Task: Currently, the function that aggregates and compiles the mapping results generated from running the omop2obo exact match and similarity mapping scripts is slow. An easy first step to improve performance

Changes Needed By (date/time): Make changes for release V3.0

Wiki: Add detail about mappings to wiki

Wiki Page:

https://github.com/callahantiff/BioLater/wiki/Mapping

Suggestions:

Add detailed documentation and preliminary results for each of the mappings that is being performed

TODO - Mapping: Verify condition code annotations

Set-up meeting to get condition code annotations verified.

Work-around for NLTK download issues

test_string_similarity failed for me because the nltk download isn't working. I found a way to get the download to work here:

https://stackoverflow.com/questions/38916452/nltk-download-ssl-certificate-verify-failed

After failing to get the solutions that fix SSL locally, I used the python code that disables SSL for this task.

TODO - Coding: Link GitHubGist snippets to repo

Add GitHubGIST post SQL queries for all terms to be mapped, including comments about how ancestors and source code are returned.

HELP: Error Analysis

@mgkahn - This issue is meant to be used for use to discuss the error analysis that we spoke about today. As a reminder, today I was tasked with figuring out which to the relationship ids we discussed were worth including and how to best categorize them. Details below:

SQL Query

Here is the query that I ended up running:

SELECT
  DISTINCT r.relationship_id,
  c1.concept_id AS SOURCE_CONCEPT_ID,
  c1.concept_name AS SOURCE_CONCEPT_LABEL,
  c2.concept_id AS TARGET_CONCEPT_ID,
  c2.concept_name AS TARGET_CONCEPT_LABEL,
FROM
  sandbox-omop.oct_2020.concept_relationship r
  JOIN sandbox-omop.oct_2020.concept c1 ON c1.concept_id = r.concept_id_1
  JOIN sandbox-omop.oct_2020.concept c2 ON c2.concept_id = r.concept_id_2
WHERE
  r.concept_id_1 IN (SELECT concept_id FROM sandbox-tc.CHCO_DeID_Oct2018.OMOP2OBO_Conditions_Concepts_Merged
                      UNION DISTINCT
                     SELECT ingredient_concept_id FROM sandbox-tc.CHCO_DeID_Oct2018.OMOP2OBO_Medications_Concepts_Merged
                      UNION DISTINCT
                     SELECT concept_id FROM sandbox-tc.CHCO_DeID_Oct2018.OMOP2OBO_Measurements_Concepts_Merged)
  AND r.relationship_id IN ("Concept replaced by", "Maps to", "Concept same_as from", "Concept poss_eq from", "Concept was_a from", "Is a")
  AND (r.valid_start_date > '2018-06-26' AND r.valid_start_date < '2020-10-17')
ORDER BY r.relationship_id;

Relationship IDs

The relationship types that I think we should use are shown grouped by two categories below:

Newly Added Concepts:

Maps to
Concept poss_eq from (synonyms)
Concept same_as from (synonyms)
Concept was_a from (concept type)
Is a (concept type)

Replaced Concepts:

Concept replaced by

Among the Newly Added Concepts, everything other than Maps to is meant to help provide a mechanism for helping to explain the missed concepts.

merge option fails with multiple

Describe the bug
If you run with no arguments (as you very likely would not) you get: "ValueError: 'default' must be a list when 'multiple' is true."

To Reproduce
Steps to reproduce the behavior:

Go to the main directory
type ./main.py at the command line
See error as above

Expected behavior
I'd expect an error relating to the lack of arguments.

Screenshots
n/a

Desktop (please complete the following information):

OS: macos Monterey
bash

Harmonise OMOP mappings with boomer or a boomer-like approach?

Currently, the mappings generated by omop2obo do not respect the semantic constraints of all participating ontologies (which makes some sense because of the significant negative impact on performance).

For example, Malignant melanoma of skin of external auditory canal (disorder) in OMOP is mapped to benign connective and soft tissue neoplasm in MONDO (among more than 1000 others) which is not ideal (unless I made a mistake when reading the omop2obo data), but could be weeded out using approaches from the "ontology merging" community, such as https://github.com/INCATools/boomer.

134294 SubClassOf Nothing

134294 SubClassOf benign connective and soft tissue neoplasm
- benign connective and soft tissue neoplasm SubClassOf musculoskeletal system benign neoplasm
  - musculoskeletal system benign neoplasm SubClassOf benign neoplasm
134294 SubClassOf skin cancer
- skin cancer SubClassOf integumentary system cancer
  - integumentary system cancer SubClassOf cancer
cancer DisjointWith benign neoplasm

Is there any way to guarantee for the OMOP2OBO mappings that:

applying the mapping does not lead to equivalents cycles involving more than 1 ID from any given ID space in OBO
Merging the mappings with the ontologies does not lead to unsatisfiable classes (like above)
There is a 1:1 mapping table that contains only the "best" mapping for between each OMOP id and OBO ontology?

This is hugely difficult issue,

TODO - Project Organization: Create README for Mapping Directory

Task: Add README docs to each of the clinical domain directories described below.

Description: Create a README to describe the formatting of each mapping file that will be added to this directory

resources/mapping/conditions
resources/mapping/measurements
resources/mapping/medications

SQL Verification (OMOP Concepts): Condition_Occurrence Concepts

PURPOSE: This query is designed to query the OMOP condition_occurrence, concept, concept_synonym, concept_ancestor, and vocabulary tables.

QUERY TYPE: OMOP Concepts
RUNTIME: 20.5 seconds
RESULTS: 29,129 rows (i.e. unique CONCEPT_IDs)

TASK: @mgkahn - Anything you think I should change or anything that needs improvement?

WITH 
  condition_concepts
  AS (SELECT
        c.condition_concept_id AS CONCEPT_ID,
        c1.concept_code AS CONCEPT_SOURCE_CODE, 
        c1.concept_name AS CONCEPT_LABEL,
        c1.vocabulary_id AS CONCEPT_VOCAB,
        v.vocabulary_version AS CONCEPT_VOCAB_VERSION
      FROM 
        CHCO_DeID_Oct2018.condition_occurrence c 
        JOIN CHCO_DeID_Oct2018.concept c1 ON c.condition_concept_id = c1.concept_id
        JOIN CHCO_DeID_Oct2018.vocabulary v ON c1.vocabulary_id = v.vocabulary_id
      WHERE 
        c1.concept_name != "No matching concept" 
        AND c1.domain_id = "Condition"
      GROUP BY CONCEPT_ID, CONCEPT_SOURCE_CODE, CONCEPT_LABEL, CONCEPT_VOCAB, CONCEPT_VOCAB_VERSION),
  
  condition_ancestors
  AS (SELECT
        ca.descendant_concept_id AS CONCEPT_ID,
        STRING_AGG(DISTINCT(CAST(c1.concept_id as STRING)), " | ") AS ANCESTOR_CONCEPT_ID,
        STRING_AGG(DISTINCT(c1.concept_code), " | ") AS ANCESTOR_SOURCE_CODE, 
        STRING_AGG(DISTINCT(c1.concept_name), " | ") AS ANCESTOR_LABEL,
        STRING_AGG(DISTINCT(c1.vocabulary_id), " | ") AS ANCESTOR_VOCAB,
        STRING_AGG(DISTINCT(v.vocabulary_version), " | ") AS ANCESTOR_VOCAB_VERSION
      FROM 
        CHCO_DeID_Oct2018.concept_ancestor ca
        JOIN CHCO_DeID_Oct2018.concept c1 ON ca.ancestor_concept_id = c1.concept_id
        JOIN CHCO_DeID_Oct2018.vocabulary v ON c1.vocabulary_id = v.vocabulary_id
      WHERE 
        ca.descendant_concept_id IN (SELECT CONCEPT_ID FROM condition_concepts)
        AND c1.concept_name != "No matching concept"
        AND c1.concept_id IS NOT NULL
        AND c1.domain_id = "Condition"
      GROUP BY CONCEPT_ID),
  
  condition_synonyms
  AS (SELECT 
        s.concept_id AS CONCEPT_ID,
        STRING_AGG(DISTINCT(s.concept_synonym_name), " | ") AS CONCEPT_SYNONYM
        FROM CHCO_DeID_Oct2018.concept_synonym s 
        WHERE s.concept_id in (SELECT CONCEPT_ID FROM condition_concepts)
        GROUP BY CONCEPT_ID)

SELECT
  c.CONCEPT_ID,
  c.CONCEPT_SOURCE_CODE,
  c.CONCEPT_LABEL,
  c.CONCEPT_VOCAB,
  c.CONCEPT_VOCAB_VERSION,
  s.CONCEPT_SYNONYM,
  a.ANCESTOR_CONCEPT_ID,
  a.ANCESTOR_SOURCE_CODE, 
  a.ANCESTOR_LABEL,
  a.ANCESTOR_VOCAB,
  a.ANCESTOR_VOCAB_VERSION
  
FROM condition_concepts c
  FULL JOIN condition_ancestors a ON c.CONCEPT_ID = a.CONCEPT_ID
  FULL JOIN condition_synonyms s ON c.CONCEPT_ID = s.CONCEPT_ID;

Wiki: Re-organize wiki to support releases

Task:

Overhaul (minor) existing organization of mapping content to enable better documentation and tracking of releases.

How:

Add wiki tab for releases and add new entries for v1.01 and v2.0
Take current content from the Conditions, Medications, and Measurements sub-pages and split it by release into it's appropriate pages

New for v2.0:

Describe new ontologies added
Incorporation of existing Loinc2HPO mappings
Alignment and subsumption of Juan's OHDSIananke mapping method
Better leveraging of OMOP concept hierarchies

TODO - Mapping: Extend/incorporate V1.0 mapping feedback into v2.0 release

Mapping Type: What type of data?

Task: Extend and integrate mapping feedback form domain experts performed on V1.0 release to V2.0 (publication version)

Measurements
Medications
Conditions

Coding: Leverage OMOP CDM Ancestor Hierarchy Levels

Task: Currently, we are utilizing the entire ancestor hierarchy for a given concept when searching for a match. This means that when matching at the ancestor level we will include potential matches for all levels, which can be very vauge (see example below).

Concept: Leukemic infiltration of skin
Map:

AND(abnormality of the skin, 
    neoplasm, sarcoma,  
    neoplasm of the skin,    
    soft tissue sarcoma
    leukemia)

While this map is correct, it includes very broad concepts like neoplasm and sarcoma. Including hierarchical level information could provide a method for further filtering the results to get the most precise match possible. Applying that logic could convert the map above to:

Map:

AND(abnormality of the skin,  
    leukemia)

When: Held until the next release (v2.0.0)

TODO - Draft the OMOP2OBO Mapping Manuscript

TODO: Draft the manuscript describing the OMOP2OBO mapping process
Journal: Natural Digital Medicine
Target Submission Date: 12/31/2019

Tasks to Complete Manuscript:

Stand up tool to facilitate mapping process
- Extend existing OMOP tool or create something new?
Complete small validation using MIMIC III OMOP data
- Consider adding verification to larger subset of UCHealth codes

@mgkahn - will use this issue to discuss our approach, list TODOs, and link to a draft of the manuscript.

Project Meeting -- 05/28/2020 @ 14:30

Meeting Date: May 28, 2020
Topic: Plan for incorporating MONDO into existing mappings
Attendees: @nicolevasilevsky

Proposed Agenda:

Discuss how to best add MONDO mappings
Understand whether to use MONDO as a replacement for DOID or include both ontologies
Briefly chat about potential challenges to replacing DOID with MONDO in PheKnowLator KR

SQL Verification (OMOP Concepts): Measurement Concepts

PURPOSE: This query is designed to query the OMOP measurement, concept, concept_synonym, concept_ancestor, and vocabulary tables.

QUERY TYPE: OMOP Concepts
RUNTIME: 8.4 seconds
RESULTS: 1,606 rows (i.e. unique CONCEPT_IDs)

TASK: @mgkahn - Anything you think I should change or anything that needs improvement?

WITH 
  measurement_concepts
  AS (SELECT
        m.measurement_concept_id AS CONCEPT_ID,
        c.concept_code AS CONCEPT_SOURCE_CODE, 
        c.concept_name AS CONCEPT_LABEL,
        c.vocabulary_id AS CONCEPT_VOCAB,
        v.vocabulary_version AS CONCEPT_VOCAB_VERSION
      FROM 
        CHCO_DeID_Oct2018.measurement m 
        JOIN CHCO_DeID_Oct2018.concept c ON m.measurement_concept_id = c.concept_id
        JOIN CHCO_DeID_Oct2018.vocabulary v ON c.vocabulary_id = v.vocabulary_id
      WHERE 
        c.concept_name != "No matching concept" 
        AND c.domain_id = "Measurement"
      GROUP BY CONCEPT_ID, CONCEPT_SOURCE_CODE, CONCEPT_LABEL, CONCEPT_VOCAB, CONCEPT_VOCAB_VERSION),
  
  measurement_ancestors
  AS (SELECT
        ca.descendant_concept_id AS CONCEPT_ID,
        STRING_AGG(DISTINCT(CAST(c1.concept_id as STRING)), " | ") AS ANCESTOR_CONCEPT_ID,
        STRING_AGG(DISTINCT(c1.concept_code), " | ") AS ANCESTOR_SOURCE_CODE, 
        STRING_AGG(DISTINCT(c1.concept_name), " | ") AS ANCESTOR_LABEL,
        STRING_AGG(DISTINCT(c1.vocabulary_id), " | ") AS ANCESTOR_VOCAB,
        STRING_AGG(DISTINCT(v.vocabulary_version), " | ") AS ANCESTOR_VOCAB_VERSION
      FROM 
        CHCO_DeID_Oct2018.concept_ancestor ca
        JOIN CHCO_DeID_Oct2018.concept c1 ON ca.ancestor_concept_id = c1.concept_id
        JOIN CHCO_DeID_Oct2018.vocabulary v ON c1.vocabulary_id = v.vocabulary_id
      WHERE 
        ca.descendant_concept_id IN (SELECT CONCEPT_ID FROM measurement_concepts)
        AND c1.concept_name != "No matching concept"
        AND c1.concept_id IS NOT NULL
        AND c1.domain_id = "Measurement"
      GROUP BY CONCEPT_ID),
  
  measurement_results
  AS (SELECT 
        measurement_concept_id AS CONCEPT_ID,
        CASE WHEN REGEXP_CONTAINS(STRING_AGG(range_low_source_value, ""), r'(?i)(positive|negative)') IS TRUE THEN "Negative/Positive" 
             WHEN REGEXP_CONTAINS(STRING_AGG(range_high_source_value, ""), r'(?i)(positive|negative)') IS TRUE THEN "Negative/Positive"         
             WHEN REGEXP_CONTAINS(STRING_AGG(range_low_source_value, ""), r'[[:digit:]]') IS TRUE THEN "Normal/Low/High"
             WHEN REGEXP_CONTAINS(STRING_AGG(range_high_source_value, ""), r'[[:digit:]]') IS TRUE THEN "Normal/Low/High"
             ELSE NULL END AS RESULT_TYPE
      FROM CHCO_DeID_Oct2018.measurement
      WHERE measurement_concept_id in (SELECT CONCEPT_ID FROM measurement_concepts)
      GROUP BY CONCEPT_ID),
  
  measurement_scale
  AS (SELECT 
        s.concept_id AS CONCEPT_ID,
        STRING_AGG(DISTINCT(s.concept_synonym_name), " | ") AS CONCEPT_SYNONYM,
        STRING_AGG(s.concept_synonym_name, ""),
        CASE WHEN REGEXP_CONTAINS(STRING_AGG(s.concept_synonym_name, ""), r'(?i)ordinal') IS TRUE THEN "ORD"
             WHEN REGEXP_CONTAINS(STRING_AGG(s.concept_synonym_name, ""), r'(?i)nominal') IS TRUE THEN "NOM"
             WHEN REGEXP_CONTAINS(STRING_AGG(s.concept_synonym_name, ""), r'(?i)quantitative') IS TRUE THEN "QUANT"
             WHEN REGEXP_CONTAINS(STRING_AGG(s.concept_synonym_name, ""), r'(?i)qualitative') IS TRUE THEN "QUAL"
             WHEN REGEXP_CONTAINS(STRING_AGG(s.concept_synonym_name, ""), r'(?i)narrative') IS TRUE THEN "NAR"
             WHEN REGEXP_CONTAINS(STRING_AGG(s.concept_synonym_name, ""), r'(?i)doc') IS TRUE THEN "DOC"
             WHEN REGEXP_CONTAINS(STRING_AGG(s.concept_synonym_name, ""), r'(?i)(panel|pnl|panl)') IS TRUE THEN "PNL"
             ELSE "Unmapped Scale Type" END AS SCALE
        FROM CHCO_DeID_Oct2018.concept_synonym s 
        WHERE s.concept_id in (SELECT CONCEPT_ID FROM measurement_concepts)
        GROUP BY CONCEPT_ID),
  
  measurement_metadata_update
  AS (SELECT
        r.CONCEPT_ID,
        CASE WHEN (r.RESULT_TYPE IS NULL AND s.SCALE = "ORD") AND REGEXP_CONTAINS(s.CONCEPT_SYNONYM, r'(?i)screen') IS TRUE THEN "Negative/Positive"
             WHEN (r.RESULT_TYPE IS NULL AND s.SCALE = "ORD") AND REGEXP_CONTAINS(s.CONCEPT_SYNONYM, r'(?i)presence') IS TRUE THEN "Negative/Positive"
             WHEN r.RESULT_TYPE IS NULL AND s.SCALE = "QUANT" THEN "Normal/Low/High"
             WHEN r.RESULT_TYPE IS NOT NULL THEN r.RESULT_TYPE
             ELSE "Unknown Result Type" END AS RESULT_TYPE,
        CASE WHEN s.SCALE IS NULL THEN "Other"  # for non-LOINC scale types
             ELSE s.SCALE END AS SCALE
        FROM
          (SELECT * FROM measurement_results) r
          FULL JOIN (SELECT * FROM measurement_scale) s ON r.CONCEPT_ID = s.CONCEPT_ID)

SELECT
  m.CONCEPT_ID,
  m.CONCEPT_SOURCE_CODE,
  m.CONCEPT_LABEL,
  m.CONCEPT_VOCAB,
  m.CONCEPT_VOCAB_VERSION,
  s.CONCEPT_SYNONYM,
  a.ANCESTOR_CONCEPT_ID,
  a.ANCESTOR_SOURCE_CODE, 
  a.ANCESTOR_LABEL,
  a.ANCESTOR_VOCAB,
  a.ANCESTOR_VOCAB_VERSION,
  u.SCALE,
  u.RESULT_TYPE
  
FROM measurement_concepts m
  FULL JOIN measurement_ancestors a ON m.CONCEPT_ID = a.CONCEPT_ID
  FULL JOIN measurement_scale s ON m.CONCEPT_ID = s.CONCEPT_ID
  FULL JOIN measurement_metadata_update u ON m.CONCEPT_ID = u.CONCEPT_ID;

callahantiff / omop2obo Goto Github PK

omop2obo's Introduction

omop2obo

What is OMOP2OBO?

What Does This Repository Provide?

How do I Learn More?

Releases

Getting Started

Install Library

Set-Up Environment

Dependencies

Running the omop2obo Library

Contributing

License

Citing this Work

Contact

omop2obo's People

Contributors

Stargazers

Watchers

Forkers

omop2obo's Issues

Describe the Bug

Impact Level

Impacted Scripts

Solution

Task

Description

Planned Approach

TASK

Basic description

Related

Additional information

Implementation details

Resources

SQL Query

Relationship IDs

134294 SubClassOf Nothing

Recommend Projects

Recommend Topics

Recommend Org