genome-nexus / genome-nexus-importer Goto Github PK

Import data into MongoDB for use by https://github.com/genome-nexus/genome-nexus/

License: MIT License

Makefile 7.54% Python 47.70% Shell 2.18% Jupyter Notebook 36.62% Dockerfile 0.36% R 0.92% JavaScript 4.68%

genome-nexus-importer's Introduction

Genome Nexus 🧬

Genome Nexus, a comprehensive one-stop resource for fast, automated and high-throughput annotation and interpretation of genetic variants in cancer. Genome Nexus integrates information from a variety of existing resources, including databases that convert DNA changes to protein changes, predict the functional effects of protein mutations, and contain information about mutation frequencies, gene function, variant effects, and clinical actionability.

Documentation 📖

See the docs

Run 💻

Alternative 1 - run genome-nexus, mongoDB and genome-nexus-vep in docker containers

First, set environment variables for Ensembl Release, VEP Assembly, location of VEP Cache, and species (since a mouse instalation is supported). If these are not, the default values from .env will be set.

The reference genome and Ensembl release must be consistent with a version in genome-nexus-importer/data/. For example grch37_ensembl92, grch38_ensembl92 or grch38_ensembl95:

export REF_ENSEMBL_VERSION=grch38_ensembl92

If you want to setup Genome Nexus for mouse, also set the SPECIES variable to 'mus_musculus'. Also see the docs to create a mouse database.

export SPECIES=mus_musculus

If you would like to do local VEP annotations instead of using the public Ensembl API, please uncomment # gn_vep.region.url=http://localhost:6060/vep/human/region/VARIANT in your application.properties. This will require you to download the VEP cache files for the preferred Ensembl Release and Reference genome, see our documentation on downloading the Genome Nexus VEP Cache. This will take several hours.

# Set local cache dir
export VEP_CACHE=<local_vep_cache>

# GRCh38 or GRCh37
export VEP_ASSEMBLY=GRCh38

Run docker-compose to create images and containers:

docker-compose up --build -d

Run without recreating images:

docker-compose up -d

Run without Genome Nexus VEP:

# Start both the Web and DB (dependency of Web) containers
docker-compose up -d web

Stop and remove containers:

docker-compose down

Alternative 2 - run genome-nexus locally, but mongoDB in docker container

# the genomenexus/gn-mongo images comes with all the required tables imported
# change latest to different version if necessary (only need to run this once)
docker run --name=gn-mongo --restart=always -p 27017:27017 -d genomenexus/gn-mongo:latest 
mvn  -DskipTests clean install
java -jar web/target/web-*.war

Alternative 3 - install mongoDB locally and run with local java

Install mongoDB manually. Then follow instructions in genome-nexus-importer to initialize the database.

After that run this:

mvn clean install
java -jar web/target/web-*.war

Test Status 👷‍♀️

branch	master	rc
status

Deploy 🚀

genome-nexus-importer's People

Contributors

Stargazers

Watchers

Forkers

inodb onursumer leexgh srodenburg marriott-er ao508 averyniceday nr23730 as1000 ruslan-forostianov pieterlukasse tempus-kent-spillner bioinfovs jeffquinn-msk crecendow

genome-nexus-importer's Issues

Import Cancer Hotspots Data

Basic workflow:

Get the v2 file from https://raw.githubusercontent.com/cBioPortal/cancerhotspots/master/webapp/src/main/resources/data/v2_multi_type_residue.txt
Get the 3d file from https://raw.githubusercontent.com/cBioPortal/cancerhotspots/master/webapp/src/main/resources/data/3d_hotspots.txt
Merge these 2 files into one
Add a new column Type:
- for all 3d mutations -> 3d
- v2 -> use Indel_Size to determine (either single residue or in-frame indel)
Run the script (on non-3d mutations) to generate these additional columns: Missense_Count, Trunc_Count, Inframe_Count, Splice_Count
Import the final file into database:
- Columns that are absolutely required for now: Hugo_Symbol, Residue, Type, Tumor_Count, Missense_Count, Trunc_Count, Inframe_Count, and Splice_Count
- Use this command to avoid duplicates: import hotspot.mutation ${DIR}/../export/hotspots_v2_and_3d.txt '--type tsv --headerline --mode upsert --upsertFields Hugo_Symbol,Residue,Type,Tumor_Count'

Fix issues in "make_one_canonical_transcript_per_gene.py" script

Ticket to make sure the issues found and discussed in #58 comment are investigated and fixed. Main suspect is the following script:

https://github.com/genome-nexus/genome-nexus-importer/blob/master/scripts/make_one_canonical_transcript_per_gene.py

Feature request to show "germline biallelic" statistics in the portal

Show "# germline homozygous" in gene and variant level pages.

In the gene level page, show "# germline homozygous" column in the table between "% Prevalence" and "Cancer type" columns. This data is in the "n_germline_homozygous" column in 'signaldb_all_variants_frequencies.txt'.
In the variant level page, show "# germline homozygous" column in the table between "# Carriers" and "% Prevalence" columns. This data is in the "n_germline_homozygous" column in 'signaldb_variants_by_cancertype_summary_statistics.txt'

~~In the variant level page, show "% germline homozygous" in the "cancer patient prevalence field. This data is in "signal.pancancer_somatic_germline_stats.txt" file. For example:~~

in https://www.signaldb.org/variant/1:g.45797228C%3ET
cancer patient prevalence: Germline 0.8% (Biallelic: 18.4%, Germline homozygous: 0.02%)

Thank you.

Pfam domain type should be integer

For example EGFR pfam domain from ensembl_biomart_transcripts.json.gz:

  "domains": [
    {
      "pfam_domain_id": "PF14843",
      "pfam_domain_start": 505.0,
      "pfam_domain_end": 636.0
    },
    {
      "pfam_domain_id": "PF01030",
      "pfam_domain_start": 361.0,
      "pfam_domain_end": 480.0
    },
    {
      "pfam_domain_id": "PF01030",
      "pfam_domain_start": 57.0,
      "pfam_domain_end": 167.0
    },
    {
      "pfam_domain_id": "PF07714",
      "pfam_domain_start": 713.0,
      "pfam_domain_end": 965.0
    },
    {
      "pfam_domain_id": "PF00757",
      "pfam_domain_start": 185.0,
      "pfam_domain_end": 338.0
    }
  ],

PFAM domain value should be integer instead of float.

Make a file with unique protein sequences and associated IDs

Maybe we can create a file that has one row per unique sequence and associated IDs? Each column can e.g. be a database so you get something like

unique protein sequence	ensembl_grch37_vxx_protein	ensembl_grch37_vxx_transcript	ensembl_grch38_vxx_protein	ensembl_grch38_vxx_transcript	uniprot
RRRRR	ENSPxxx	ENSTyyyyy	ENSPxxx	ENSTyyyyy	Pzzzz

We can then reuse this file for uniprot, oncokb and hotspot transcript assignments. It also allows to easily add other potential protein resources

Env var REF_ENSEMBL_VERSION is ignored when mongo data is initialized (Dockerfile)

This issue happens only when Dockerfile is used.
Please see genome-nexus/genome-nexus#438 (comment) for detailed description of the issue.

It's not only REF_ENSEMBL_VERSION that is ignored. All env vars defined here https://github.com/genome-nexus/genome-nexus/blob/master/docker-compose.yml#L18 are not visible to scripts/import_mongo.sh

Support mouse annotations

cBioPortal used to support mouse genes:

cBioPortal/cbioportal#6312

All our annotations are for human

Add testing for generation of ensembl json

Conda environment for import pipeline

To generate a new dataset for Genome Nexus, the file data/Makefile can to be executed, which has several Python and R dependencies. This could be listed in a Conda dependencies file, or even an environment.

Fix critical security vulnerabilties in v0.27

Snyk scan showed that there are 20 critical vulnerabilities in the v0.27 release --> https://app.snyk.io/org/cbioportal/project/c89bd263-08a7-4f48-a0ac-3c4bfc9e7534

I propose to upgrade the base image to bitnami/mongodb:6.0.5.

Transcript analysis (Grch37/38) - Log

biomart mapping file, genes do no have entrez

grch37 grch37_genes_without_entrez.txt
grch38 grch38_genes_without_entrez.txt

genes not in cBioPortal

grch37 grch37_gene_not_in_cbioportal.txt
grch38 grch38_gene_not_in_cbioportal.txt

hugo symbol does not match with cBioPortal

Problem(mismatch) transcripts

problem_transcripts.txt

gene protein length check (ones without protein length do not have pfam, vice versa)

grch37 grch37_info.txt
grch38 grch38_info.txt

OncoKB issues

Good thing is, for both 37/38, they are using the same transcript.
But there are still two issues

the transcript GN uses vs OncoKB Use are different (I used the msk-transcript column from GN)
some of the hugo symbols are different
grch37_mismatch_gn_oncokb.txt

Update canonical transcripts and hotspots for other versions

We have a PR to update grch38_ensembl95 canonical transcript, we need to do the same to other versions as well.
Transcript PR: #62
Hotspots PR: #64

Hotspots data not grch38 compatible

Hotspots data is not ported to grch38 yet.

E.g. these two files still contain exactly the same transcript ids for both grch37 and grch38:

We need to update the grch38 version to contain the updated transcript ids.

E.g. for BRAF this would probably be ENST00000646891 instead of ENST00000288602. See also: https://ensembl.org/homo_sapiens/Transcript/Summary?t=ENST00000288602

download from https://gnomad.broadinstitute.org/downloads
- v2 for grch37
- v3 for grch38
filter to ancestry allele frequencies that we use, e.g. https://www.genomenexus.org/variant/17:g.41276045_41276046del
add the data to importer module
import to mongo
develop api endpoint

entrez_gene_id missing

Querying by entrez gene id is not working with master version of the data

Simplify makefile gzip/gunzip

See: #39 (comment)