Coder Social home page Coder Social logo

gene-harmony-analysis's Introduction

gene-harmony-analysis

Background

Human gene symbols are regulated and follow guidelines established by HGNC. All genes are designated an authoritative symbol (also known as a primary gene symbol), a descriptive name, and an HGNC identification number. Although primary gene symbols are monitored to be unique, alias symbols are not.

Aliases are additional gene symbols and short descriptions that are used synonymously for the gene and/or any associated gene products. Gene symbols are curated from use in databases, experimental results, and literature. Primary gene symbols and aliases play a crucial role in referencing genes across publications, medical records, and data collections.

Our preliminary research uncovered collisions, or instances where a single gene symbol was used for multiple different genes. We have categorized them into two kinds:

a) Alias-primary collisions, which are gene symbols that are used as a primary gene symbol and an alias. The primary gene symbol KRAS is an alias in addition to a primary gene symbol.

b) Alias-alias collisions are gene symbols that represent an alias of multiple genes. The gene symbol VH is an alias for 35 genes in the NCBI database.

Out of the 43,164 genes in the HGNC database, 483 (1.12%) had alias-primary collisions and 2,084 (4.83%) had alias-alias collisions. The Ensembl database, which has 40,353 genes, was found to have alias-primary collisions in 218 (0.54%) of genes and alias-alias collisions in 3,680 (9.12%) of genes.

The NCBI database, which had the largest number of genes- 75,346, had 1,712 (2.27%) and 5,670 (7.53%) of genes with alias-primary and alias-alias collisions respectively, illustrating the prevalence of ambiguity that challenges the aggregation of genomic knowledge.

The total_alias_overlap Jupyter notebook shows the analysis to get these values (the notebook will be condensed and made more efficient)

collision_graphic

Purpose

The difficulties associated with resolving ambiguity and ensuring accurate understanding of gene symbols restrict the rate of clinical decision-making and contribute to confusion in gene knowledge aggregation. The gene nomenclature system would be most effective if it is unambiguous with a tool to take existing knowledgebase entries as inputs to resolve.

This curated collection of collision data will be a foundation for disambiguating gene symbols.

How can you help?

Contributing information on collisions that you come across will help collect data on the collisions that would be most impactful to resolve as well as increasing the data available for developing resolution strategies for downstream tool development.

  1. In the collision records folders, there are collision records that are completed (but can always be updated) and a blank sample record to use as a template.
  2. The contributing documentation explains the different features that are included in a collision record.
  3. To propose an update or make a new collision record create a personal fork to the repo. New collision records should be YAML files named after the collision in either the alias-alias or alias-primary collision records folders. Once created, the review process can be started with the creation of a pull request.

Contact Information

For any feedback, questions, or conversation, please make an issue.

gene-harmony-analysis's People

Contributors

anastasiabratulin avatar korikuzma avatar

Stargazers

 avatar Daniel Puthawala avatar Alex H. Wagner, PhD avatar

Watchers

Malachi Griffith avatar James Stevenson avatar

gene-harmony-analysis's Issues

Automatic Fill Information Fields

I wonder if there is a way to help fill out collision records by providing a way to automatically fill out the ENSG id and chromosomal location fields.

Should there be a field for which source the collision is present in?

In the collision record, does it make sense, does it add value, to add a field that describes which source (NCBI, HGNC, or Ensembl) features that collision for that gene? For example, in the VH collision record, only NCBI Info uses VH as an alias and therefor the collision VH is only present in NCBI.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.