Coder Social home page Coder Social logo

cancervariants / gene-normalization Goto Github PK

View Code? Open in Web Editor NEW
1.0 6.0 3.0 2.19 MB

Services and guidelines for normalizing genes

Home Page: https://gene-normalizer.readthedocs.io/latest/

License: MIT License

Python 99.77% Shell 0.08% Dockerfile 0.15%
bioinformatics biomedical-informatics genetics precision-medicine

gene-normalization's People

Contributors

ahwagner avatar jsstevenson avatar korikuzma avatar nickzoic avatar ohsu-machineuser avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

gene-normalization's Issues

Consider switching current partition and sort keys

We currently add a GSI on concept_id in #97 . However, we should see if we're able to use concept_id as the partition key and label_and_type as the sort key to prevent the extra creation of a GSI. Did not do this in #97 due to interest in time

NCBI Source Meta

NCBI uses 3 different files (history, info, gff). History and info data are updated daily, but gff data is versioned by assembly. We currently use the timestamp at which we retrieve the data (we should also fix this so that it's the timestamp from the ftp site). I think we should consider storing metadata for each file. Also, the current source meta does not indicate the files used and instead points to the ftp site.

Normalize match type bug

The merged concept for hgnc:37133 has alternate_labels: "A1BGAS", "FLJ23569", "NCRNA00181", "A1BG-AS". Querying these alternate_labels returns different match_type scores, when they theoretically should return the same score.

Clean Up Repo

  • Better documentation
    • Add type hints
  • DRY
  • Remove unused code
  • Rather than using vrs-python's VRS models, use ga4gh.vrsatile.pydantic models in gene.vrs_locations
  • Check if we can remove EBSampleApp-Python.iml?
  • Add flake8-annotations + double quotes
  • String enums in schemas

Capture previous gene identifiers from NCBI

NCBI has retired gene identifiers in the past, e.g.:

  • ncbigene:401317 now maps to ncbigene:9586. Our normalizer should match the old ID to the current record. This should be treated analogous to the "previous symbols" attribute in HGNC.
  • ncbigene:103344718 is a discontinued gene. We should normalize to concepts like this, but also have a status attribute that makes it clear this is no longer considered a gene. We should also emit a warning in our warnings attribute for each such entry, akin to: ncbigene:103344718 is a discontinued gene concept.

VRS locations

When specifying locations, we should use VRS Location objects.

ChromosomeLocation for the ISCN-style entries in the HGNC "location" field

SequenceLocation for the Chr/Start/Stop entries from ensembl.

This should reduce the following attributes:
seqid
start
stop
strand
location

down to:
location: (VRS Location)
strand: enum( '+', '-', Null)

Accessing local files

Add an option to CLI to use local files rather than downloading from the source's FTP site

Test data

Consider creating sample data to test ETL methods. If we don't go this route, we should clean up the current test data

Add xrefs attribute

Separate between those representing gene concepts from those representing associated concepts.

TPX2 raises TypeError in search and normalize

Apr  7 00:37:15 ip-10-130-14-142 web: File "/var/app/current/gene/main.py", line 114, in normalize
Apr  7 00:37:15 ip-10-130-14-142 web: resp = query_handler.normalize(html.unescape(q))
Apr  7 00:37:15 ip-10-130-14-142 web: File "/var/app/current/gene/query.py", line 483, in normalize
Apr  7 00:37:15 ip-10-130-14-142 web: matching_records.sort(key=self._record_order)
Apr  7 00:37:15 ip-10-130-14-142 web: File "/var/app/current/gene/query.py", line 412, in _record_order
Apr  7 00:37:15 ip-10-130-14-142 web: src = record['src_name'].upper()
Apr  7 00:37:15 ip-10-130-14-142 web: TypeError: 'NoneType' object is not subscriptable

Validation Errors during load

We had been using vrs-python models for validation. The addition of validators being used in schemas are now causing pydantic validation errors when loading sources

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.