Light

hubmapconsortium / asctb-ct-label-mapper Goto Github PK

View Code? Open in Web Editor NEW

2.0 4.0 1.0 1.44 MB

asctb-ct-label-mapper: A package to recommend controlled vocabulary for annotations of scRNA-seq datasets. and thereby enable cross-dataset or cross-experiment comparison of annotations.

License: MIT License

Python 99.81% Shell 0.19%

bert-model cosine-similarity data-engineering embeddings-similarity natural-language-processing python single-cell-rna-seq web-scraping human-reference-atlas

asctb-ct-label-mapper's Introduction

ASCT+B Cell-Type Label Mapper

asctb_ct_label_mapper is a package to ensure controlled vocabulary for annotations of scRNA-seq datasets. The goal is to enable cross-dataset or cross-experiment comparison of data by aligning annotations to a standard reference point.

Given a specific organ's scRNA-seq annotated dataset (.h5ad/.rds), you can create a translation file for mapping raw-labels to the ASCT+B naming convention.

General flow:

Create the reference-embeddings by fetching the corresponding ASCT+B organ (with latest version):

Fetch the ASCT+B dataset from the ASCT+B Master Tables.
Parse the data to create wrangled 3 columns CT-ID, CT-Name, CT-Label.
Fetch Description of each unique CT-ID from Cell Ontology.
Use NLP-preprocessing best practices for the text fields.
Use a Sentence-Transformer model hosted on Hugging Face to create embeddings of shape cx768 (c is the Number of unique CTs in the ASCT+B Master table).

For each input raw Cell-Type annotation/cluster label, create the embedding and compare it against the embeddings generated in step #1.
Identify the best matching ASCT+B label for the input raw label.
You can also visualize the agreeability of cross-dataset annotations before and after using ASCTB CT Label Mapper.

A walkthrough is available on Google Colab here.

Architecture:

Step 1: Create Reference Embeddings

Step 2: Map input Cell-Type labels to these Reference Embeddings

Output: Top-2 matches from ASCT+B as suggestions for each of query Cell-Type annotation label

Expert provides feedback in order to finalize the translation from query annotation label to ASCT+B annotation label.

Cosine Similarity

asctb-ct-label-mapper's People

Contributors

Stargazers

Watchers

Forkers

vikrantdeshpande09876

asctb-ct-label-mapper's Issues

Update documentation and set up docs-pages

Add general documentation for package.

Update code-docstring for execute_nlp_pipeline() and other functions in the NLP script.

"""Returns the cleaned version of the annotation label after performing the following steps:

```python
remove_whitespaces()
expand_word_contractions()
replace_special_chars()
convert_number_to_word()
make_lowercase()
get_root_word()
```

Args:
    input_label (str): Input annotation label text.

Returns:
    str: Cleaned version of the annotation label text.
"""

Include Google-sheet "gid" for ASCT+B API call

Improve get_asctb_data_url() to also pull out gid from the Sheet-Config data on line 59, to make code more modular.
Update fetch_ct_info_from_asctb_google_sheet() line 88 to also include '&gid=0129321849sdkj00329'.

Enhancing and operationalizing crosswalks for multiple reference datasets

Work completed up till now:

Azimuth Kidney --> ASCTB Kidney v1.2:

Translations verified by Sanjay Jain and Ellen Q.

Azimuth Lung HLCAv2 --> ASCTB Lung v1.2

Translations verified by Gloria Pryhuber

Azimuth Heart --> ASCTB Heart v1.2

Translations verified by Marc Halushka

Next-steps:

a. Confirm with Katy and Ellen which crosswalks to focus on. Brief discussion was about Azimuth's other reference organs, CellTypist organs, and PopV/Tabula Sapiens organs.
b. Confirm if we need all organ-datasets from CellTypist and PopV/Tabula Sapiens mapped to ASCTB using this package?
c. Souradeep to operationalize this package into a data-pipeline with potential for CICD.
d. Future feature request - Add logic to also consider gene-expression profiles (biomarkers from query-dataset) mapped to ASCTB canonical markers, in order to make a more reliable cross-dataset translation mapping.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.