Coder Social home page Coder Social logo

hubmapconsortium / asctb-ct-label-mapper Goto Github PK

View Code? Open in Web Editor NEW
2.0 4.0 1.0 1.44 MB

asctb-ct-label-mapper: A package to recommend controlled vocabulary for annotations of scRNA-seq datasets. and thereby enable cross-dataset or cross-experiment comparison of annotations.

License: MIT License

Python 99.81% Shell 0.19%
bert-model cosine-similarity data-engineering embeddings-similarity natural-language-processing python single-cell-rna-seq web-scraping human-reference-atlas

asctb-ct-label-mapper's Introduction

ASCT+B Cell-Type Label Mapper

asctb_ct_label_mapper is a package to ensure controlled vocabulary for annotations of scRNA-seq datasets. The goal is to enable cross-dataset or cross-experiment comparison of data by aligning annotations to a standard reference point.

Given a specific organ's scRNA-seq annotated dataset (.h5ad/.rds), you can create a translation file for mapping raw-labels to the ASCT+B naming convention.


General flow:

  1. Create the reference-embeddings by fetching the corresponding ASCT+B organ (with latest version):
  • Fetch the ASCT+B dataset from the ASCT+B Master Tables.
  • Parse the data to create wrangled 3 columns CT-ID, CT-Name, CT-Label.
  • Fetch Description of each unique CT-ID from Cell Ontology.
  • Use NLP-preprocessing best practices for the text fields.
  • Use a Sentence-Transformer model hosted on Hugging Face to create embeddings of shape cx768 (c is the Number of unique CTs in the ASCT+B Master table).
  1. For each input raw Cell-Type annotation/cluster label, create the embedding and compare it against the embeddings generated in step #1.

  2. Identify the best matching ASCT+B label for the input raw label.

  3. You can also visualize the agreeability of cross-dataset annotations before and after using ASCTB CT Label Mapper.


A walkthrough is available on Google Colab here.


Architecture:


Step 1: Create Reference Embeddings

Step 1: Create Reference Embeddings


Step 2: Map input Cell-Type labels to these Reference Embeddings

Step 2: Map input labels to Reference Embeddings


Output: Top-2 matches from ASCT+B as suggestions for each of query Cell-Type annotation label

Expert provides feedback in order to finalize the translation from query annotation label to ASCT+B annotation label.

Output_summary


Cosine Similarity

Cosine Similarity


asctb-ct-label-mapper's People

Contributors

vikrantdeshpande09876 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

asctb-ct-label-mapper's Issues

Update documentation and set up docs-pages

Add general documentation for package.

Update code-docstring for execute_nlp_pipeline() and other functions in the NLP script.

"""Returns the cleaned version of the annotation label after performing the following steps:

```python
remove_whitespaces()
expand_word_contractions()
replace_special_chars()
convert_number_to_word()
make_lowercase()
get_root_word()
```

Args:
    input_label (str): Input annotation label text.

Returns:
    str: Cleaned version of the annotation label text.
"""

Include Google-sheet "gid" for ASCT+B API call

Improve get_asctb_data_url() to also pull out gid from the Sheet-Config data on line 59, to make code more modular.
Update fetch_ct_info_from_asctb_google_sheet() line 88 to also include '&gid=0129321849sdkj00329'.

Enhancing and operationalizing crosswalks for multiple reference datasets

Work completed up till now:

  1. Azimuth Kidney --> ASCTB Kidney v1.2:

Translations verified by Sanjay Jain and Ellen Q.

  1. Azimuth Lung HLCAv2 --> ASCTB Lung v1.2

Translations verified by Gloria Pryhuber

  1. Azimuth Heart --> ASCTB Heart v1.2

Translations verified by Marc Halushka

Next-steps:

a. Confirm with Katy and Ellen which crosswalks to focus on. Brief discussion was about Azimuth's other reference organs, CellTypist organs, and PopV/Tabula Sapiens organs.
b. Confirm if we need all organ-datasets from CellTypist and PopV/Tabula Sapiens mapped to ASCTB using this package?
c. Souradeep to operationalize this package into a data-pipeline with potential for CICD.
d. Future feature request - Add logic to also consider gene-expression profiles (biomarkers from query-dataset) mapped to ASCTB canonical markers, in order to make a more reliable cross-dataset translation mapping.

Picture1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.