Coder Social home page Coder Social logo

genome-nexus / genome-nexus-importer Goto Github PK

View Code? Open in Web Editor NEW
4.0 4.0 15.0 1.76 GB

Import data into MongoDB for use by https://github.com/genome-nexus/genome-nexus/

License: MIT License

Makefile 7.54% Python 47.70% Shell 2.18% Jupyter Notebook 36.62% Dockerfile 0.36% R 0.92% JavaScript 4.68%

genome-nexus-importer's Introduction

Genome Nexus ๐Ÿงฌ

Genome Nexus, a comprehensive one-stop resource for fast, automated and high-throughput annotation and interpretation of genetic variants in cancer. Genome Nexus integrates information from a variety of existing resources, including databases that convert DNA changes to protein changes, predict the functional effects of protein mutations, and contain information about mutation frequencies, gene function, variant effects, and clinical actionability.

Documentation ๐Ÿ“–

See the docs

Run ๐Ÿ’ป

Alternative 1 - run genome-nexus, mongoDB and genome-nexus-vep in docker containers

First, set environment variables for Ensembl Release, VEP Assembly, location of VEP Cache, and species (since a mouse instalation is supported). If these are not, the default values from .env will be set.

The reference genome and Ensembl release must be consistent with a version in genome-nexus-importer/data/. For example grch37_ensembl92, grch38_ensembl92 or grch38_ensembl95:

export REF_ENSEMBL_VERSION=grch38_ensembl92

If you want to setup Genome Nexus for mouse, also set the SPECIES variable to 'mus_musculus'. Also see the docs to create a mouse database.

export SPECIES=mus_musculus

If you would like to do local VEP annotations instead of using the public Ensembl API, please uncomment # gn_vep.region.url=http://localhost:6060/vep/human/region/VARIANT in your application.properties. This will require you to download the VEP cache files for the preferred Ensembl Release and Reference genome, see our documentation on downloading the Genome Nexus VEP Cache. This will take several hours.

# Set local cache dir
export VEP_CACHE=<local_vep_cache>

# GRCh38 or GRCh37
export VEP_ASSEMBLY=GRCh38

Run docker-compose to create images and containers:

docker-compose up --build -d

Run without recreating images:

docker-compose up -d

Run without Genome Nexus VEP:

# Start both the Web and DB (dependency of Web) containers
docker-compose up -d web

Stop and remove containers:

docker-compose down

Alternative 2 - run genome-nexus locally, but mongoDB in docker container

# the genomenexus/gn-mongo images comes with all the required tables imported
# change latest to different version if necessary (only need to run this once)
docker run --name=gn-mongo --restart=always -p 27017:27017 -d genomenexus/gn-mongo:latest 
mvn  -DskipTests clean install
java -jar web/target/web-*.war

Alternative 3 - install mongoDB locally and run with local java

Install mongoDB manually. Then follow instructions in genome-nexus-importer to initialize the database.

After that run this:

mvn clean install
java -jar web/target/web-*.war

Test Status ๐Ÿ‘ทโ€โ™€๏ธ

branch master rc
status Build Status Build Status

Deploy ๐Ÿš€

Deploy

genome-nexus-importer's People

Contributors

ao508 avatar averyniceday avatar inodb avatar jeffquinn-msk avatar leexgh avatar nr23730 avatar onursumer avatar pieterlukasse avatar sheridancbio avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

genome-nexus-importer's Issues

Import Cancer Hotspots Data

Basic workflow:

Feature request to show "germline biallelic" statistics in the portal

Show "# germline homozygous" in gene and variant level pages.

  • In the gene level page, show "# germline homozygous" column in the table between "% Prevalence" and "Cancer type" columns. This data is in the "n_germline_homozygous" column in 'signaldb_all_variants_frequencies.txt'.
  • In the variant level page, show "# germline homozygous" column in the table between "# Carriers" and "% Prevalence" columns. This data is in the "n_germline_homozygous" column in 'signaldb_variants_by_cancertype_summary_statistics.txt'

In the variant level page, show "% germline homozygous" in the "cancer patient prevalence field. This data is in "signal.pancancer_somatic_germline_stats.txt" file. For example:

in https://www.signaldb.org/variant/1:g.45797228C%3ET
cancer patient prevalence: Germline 0.8%ย (Biallelic:ย 18.4%, Germline homozygous: 0.02%)

Thank you.

Pfam domain type should be integer

For example EGFR pfam domain from ensembl_biomart_transcripts.json.gz:

  "domains": [
    {
      "pfam_domain_id": "PF14843",
      "pfam_domain_start": 505.0,
      "pfam_domain_end": 636.0
    },
    {
      "pfam_domain_id": "PF01030",
      "pfam_domain_start": 361.0,
      "pfam_domain_end": 480.0
    },
    {
      "pfam_domain_id": "PF01030",
      "pfam_domain_start": 57.0,
      "pfam_domain_end": 167.0
    },
    {
      "pfam_domain_id": "PF07714",
      "pfam_domain_start": 713.0,
      "pfam_domain_end": 965.0
    },
    {
      "pfam_domain_id": "PF00757",
      "pfam_domain_start": 185.0,
      "pfam_domain_end": 338.0
    }
  ],

PFAM domain value should be integer instead of float.

Make a file with unique protein sequences and associated IDs

Maybe we can create a file that has one row per unique sequence and associated IDs? Each column can e.g. be a database so you get something like

unique protein sequence ensembl_grch37_vxx_protein ensembl_grch37_vxx_transcript ensembl_grch38_vxx_protein ensembl_grch38_vxx_transcript uniprot
RRRRR ENSPxxx ENSTyyyyy ENSPxxx ENSTyyyyy Pzzzz

We can then reuse this file for uniprot, oncokb and hotspot transcript assignments. It also allows to easily add other potential protein resources

Conda environment for import pipeline

To generate a new dataset for Genome Nexus, the file data/Makefile can to be executed, which has several Python and R dependencies. This could be listed in a Conda dependencies file, or even an environment.

Transcript analysis (Grch37/38) - Log

biomart mapping file, genes do no have entrez

genes not in cBioPortal

hugo symbol does not match with cBioPortal

Problem(mismatch) transcripts

problem_transcripts.txt

gene protein length check (ones without protein length do not have pfam, vice versa)

OncoKB issues

Good thing is, for both 37/38, they are using the same transcript.
But there are still two issues

  • the transcript GN uses vs OncoKB Use are different (I used the msk-transcript column from GN)
  • some of the hugo symbols are different
    grch37_mismatch_gn_oncokb.txt

Hotspots data not grch38 compatible

Hotspots data is not ported to grch38 yet.

E.g. these two files still contain exactly the same transcript ids for both grch37 and grch38:

We need to update the grch38 version to contain the updated transcript ids.

E.g. for BRAF this would probably be ENST00000646891 instead of ENST00000288602. See also: https://ensembl.org/homo_sapiens/Transcript/Summary?t=ENST00000288602

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.