The gene-normalization from cancervariants

Improve performance for creating concept groups

Creating concept groups is slow and creating concept groups in production environment is even slower. We should look into speeding this up.

Use latest version of ga4gh.vrs

Use source's latest data version

NCBI
Ensembl

Create normalize endpoint

The normalize endpoint should generate a single, merged concept for search terms.

Add any additional Ensembl and Entrez data used in DGIdb

Ensembl: biotype

Consider switching current partition and sort keys

We currently add a GSI on concept_id in #97 . However, we should see if we're able to use concept_id as the partition key and label_and_type as the sort key to prevent the extra creation of a GSI. Did not do this in #97 due to interest in time

Add HGNC Alternative Loci Set

We currently only use the non alternative loci set. We should also include the alternative loci set from the download page.

Fix normalize response

Add lookup by xref

Return all matches for a gene

Not just the strongest match per source

NCBI Source Meta

NCBI uses 3 different files (history, info, gff). History and info data are updated daily, but gff data is versioned by assembly. We currently use the timestamp at which we retrieve the data (we should also fix this so that it's the timestamp from the ftp site). I think we should consider storing metadata for each file. Also, the current source meta does not indicate the files used and instead points to the ftp site.

Normalize match type bug

The merged concept for hgnc:37133 has alternate_labels: "A1BGAS", "FLJ23569", "NCRNA00181", "A1BG-AS". Querying these alternate_labels returns different match_type scores, when they theoretically should return the same score.

Clean Up Repo

Better documentation
- Add type hints
DRY
Remove unused code
Rather than using vrs-python's VRS models, use ga4gh.vrsatile.pydantic models in gene.vrs_locations
Check if we can remove EBSampleApp-Python.iml?
Add flake8-annotations + double quotes
String enums in schemas

Import architecture updates from therapy-normalization

Update to latest version of VRS and VRSATILE

Some models/fields have been renamed or deprecated

EB use python 3.8

Our EB currently uses python 3.7. We should upgrade to 3.8.

Implement HGNC Normalizer

https://www.genenames.org/download/statistics-and-files/

Allow SEQREPO_DATA_PATH to be set by env var

Add url to OpenAPI schema

Add env var for seqrepo_data_path

Capture previous gene identifiers from NCBI

NCBI has retired gene identifiers in the past, e.g.:

ncbigene:401317 now maps to ncbigene:9586. Our normalizer should match the old ID to the current record. This should be treated analogous to the "previous symbols" attribute in HGNC.
ncbigene:103344718 is a discontinued gene. We should normalize to concepts like this, but also have a status attribute that makes it clear this is no longer considered a gene. We should also emit a warning in our warnings attribute for each such entry, akin to: ncbigene:103344718 is a discontinued gene concept.

VRS locations

When specifying locations, we should use VRS Location objects.

ChromosomeLocation for the ISCN-style entries in the HGNC "location" field

SequenceLocation for the Chr/Start/Stop entries from ensembl.

This should reduce the following attributes:
seqid
start
stop
strand
location

down to:
location: (VRS Location)
strand: enum( '+', '-', Null)

Add ga4gh.vrsatile.pydantic to setup.cfg

Update validation for schemas

@jarbesfeld 's GH Actions in py-gene-fusions are failing due to our schema classes

Add other_identifier match type and DB reference

Update schema examples

Forgot to update schema examples to reflect VRS/VRSATILE updates

Add validator methods for GeneDescriptor, SequenceLocation, and ChromosomeLocation

@jarbesfeld will be using these models in py-gene-fusions

Accessing local files

Add an option to CLI to use local files rather than downloading from the source's FTP site

Dockerfile

A docker container would be useful

FTP Download

Switch to downloading files from FTP sites

Implement NCBI Normalizer

https://www.ncbi.nlm.nih.gov/home/download/

Rearrange imports

This will help with going serverless

other_id -> xref, xref -> associated_with

Test data

Consider creating sample data to test ETL methods. If we don't go this route, we should clean up the current test data

Get app logs to show in EB

Automate pypi release using GH Actions

Update search and normalize response

Set use_enum_values in pydantic model config
Change response_datetime to str

Apr  7 00:37:15 ip-10-130-14-142 web: File "/var/app/current/gene/main.py", line 114, in normalize
Apr  7 00:37:15 ip-10-130-14-142 web: resp = query_handler.normalize(html.unescape(q))
Apr  7 00:37:15 ip-10-130-14-142 web: File "/var/app/current/gene/query.py", line 483, in normalize
Apr  7 00:37:15 ip-10-130-14-142 web: matching_records.sort(key=self._record_order)
Apr  7 00:37:15 ip-10-130-14-142 web: File "/var/app/current/gene/query.py", line 412, in _record_order
Apr  7 00:37:15 ip-10-130-14-142 web: src = record['src_name'].upper()
Apr  7 00:37:15 ip-10-130-14-142 web: TypeError: 'NoneType' object is not subscriptable

cancervariants / gene-normalization Goto Github PK

gene-normalization's People

Contributors

Stargazers

Watchers

Forkers

gene-normalization's Issues

Recommend Projects

Recommend Topics

Recommend Org