broadinstitute / oncotator Goto Github PK

License: Other

Python 99.73% Shell 0.27%

oncotator's Introduction

Oncotator Is No Longer Supported or Maintained

Funcotator is a new Functional Annotation tool (the spiritual successor to Oncotator). It is:
Better: many bugs have been fixed and edge cases have been improved.
Faster: annotate more variants in less time
Easier to use and deploy: a single jar with no tricky installation or dependencies, and a tool for fetching the datasources
Funcotator is available as part of the GATK toolkit and works out of the box with both Mutect2 and HaplotypeCaller for somatic and germline annotation, respectively.
It also has a Featured Workspace on Terra.
A Funcotator tutorial, as well as a full comparison of Funcotator and Oncotator and other helpful information can be found on the GATK website here:
https://gatk.broadinstitute.org/hc/en-us/articles/360035889931-Funcotator-Information-and-Tutorial
The github repository for GATK and Funcotator can be found here:
https://github.com/broadinstitute/gatk

======================
Oncotator

License

Oncotator is free for non-profit users. Please see the LICENSE file here for more information.

Package Overview

The name of the directory, oncotator, is also the name of the distribution. This distribution contains the oncotator package.

For more information: http://www.broadinstitute.org/cancer/cga/oncotator

This distribution is the standalone version of Oncotator. If you wish to use the web interface: http://www.broadinstitute.org/oncotator

Please note that the web interface uses an older version of Oncotator and older datasources.

All documentation can be found in the Oncotator forums: http://gatkforums.broadinstitute.org/categories/oncotator

Installation

Currently, Windows is unsupported, though this is due to a dependency, pysam, being unsupported in Windows.

IMPORTANT: You will need root access to your python interpreter or a python virtual environment. More information about virtual environments can be found on the following site: https://pypi.python.org/pypi/virtualenv

As a reminder, virtualenv.py can be run as a standalone script, thereby bypassing superuser requirements. Please see the above link for more details.

Before installing, we recommend installing pyvcf and numpy manually, before attempting the Oncotator install. You may need to prepend each of the following commands with sudo:

$ pip install numpy
$ pip install pyvcf

This distribution is installable through the standard setup.py method. Note that Distribute will be installed as part of the setup process if it isn't already:

$ python setup.py install

Because the setup.py specifies an entry point as a console script, oncotator and initializeDatasource will be installed into your Python's bin/ directory

Unit Test Setup

NOTE: Unit tests require a minimum of 4GB to run.

Before running the unit tests for the first time, please perform the following steps:

Execute the following three lines in the same directory as setup.py:

$ mkdir -p out
$ ln -s test/configs configs
$ ln -s test/testdata testdata

Many unit tests rely on having the standard set of hg19 datasources, which are in a separate download. To point the unit testing framework to your datasources, you must create a personal test config:

$ cp configs/personal-test.config.template configs/personal-test.config
In configs/personal-test.config, replace ```dbDir=MY_DB_DIR/``` with ```dbDir=``` the appropriate path to you oncotator datasource directory.

Running the Automated Unit Tests (with Virtual Env Creation) --------------------The automated unit tests (run_ci_tests.sh) require 6 GB to run. This can take a fair amount of time (~20 minutes), since a full install into a new virtual environment is performed.

Execute the following line in the same directory as setup.py (provide the appropriate path to the db dir with your datasources):

$ bash run_ci_tests.sh <DB_DIR>

Running the Automated Unit Tests (without Virtual Env Creation) --------------------You can simply run the unit tests in the currently active python environment, which takes a lot less time (< 6 minutes), but requires all dependencies to be installed. However, you must follow the instructions for Unit Test Setup above (Steps 1 and 2), if not already performed. Then run (in the same directory as setup.py):

$ nosetests --all-modules --exe -w test -v --processes=4 --process-timeout=480  --process-restartworker

Please note that there is a known bug with --processes and output to XML. If you alter the above nosetests command to include junit xml (--with-xunit), remove the last three options (`--processes=4 --process-timeout=480 --process-restartworker`). This will cause tests to only run on one core.

Creating a Virtual Environment for Running Oncotator --------------------Follow these steps from the same directory as setup.py. The first command will take several minutes:

bash scripts/create_oncotator_venv.sh <venv_location>
source <venv_location>/bin/activate
python setup.py install

Version Information

Once Oncotator is installed, run it with the -V flag to get version information:

$ Oncotator -V

Git Process Starting with v1.0.0.0 (Developers)

For an overview on the oncotator process for adding features, bugfixes, and general day-to-day branching, please see:: http://nvie.com/posts/a-successful-git-branching-model/

Help

Please post questions, issues, and feature requests in the forum at http://gatkforums.broadinstitute.org/categories/oncotator

oncotator's People

Contributors

Stargazers

Watchers

oncotator's Issues

Contig and alt issue

Contigs and alts in the header are parsed incorrectly, and thus, do not appear in the output file.

tsv file sorting needs to have RAM usage verified -- run on large file

We cannot close out the tsv file sorting in oncotator until a large file has been run and RAM usage monitored.

This may require a test that is not in the oncotator install package.

The maf file in the Kryuokov challenge would be a good one, but may need to have its comments and header trimmed.

This will also allow us to collect some timing information.

: 'NoneType' object is not iterable

1.0.0rc31
in here: /cga/tcga-gdac/germline/resources/esp6500SIdataV2/debugOncotator/

i run:
oncotator -i VCF -o VCF ESP6500SI-V2.snps_indels.head34.vcf ESP6500SI-V2.snps_indels.head34.oncotated.vcf hg19

and get this:

oncotator -i VCF -o VCF ESP6500SI-V2.snps_indels.head34.vcf ESP6500SI-V2.snps_indels.head34.oncotated.vcf hg19
2013-11-06 13:52:47,864 WARNING [oncotator.utils.ConfigUtils:186] Could not find config file (vcf.in.config). Trying configs/ prepend.
2013-11-06 13:52:47,871 WARNING [oncotator.utils.ConfigUtils:186] Could not find config file (vcf.out.config). Trying configs/ prepend.
Traceback (most recent call last):
File "/xchip/tcga/Tools/oncotator/onco_env/ubin/oncotator", line 9, in
load_entry_point('Oncotator==v1.0.0.0rc31', 'console_scripts', 'oncotator')()
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc31-py2.7.egg/oncotator/Oncotator.py", line 224, in main
annotator.annotate()
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc31-py2.7.egg/oncotator/Annotator.py", line 236, in annotate
metadata = self._createMetadata()
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc31-py2.7.egg/oncotator/Annotator.py", line 195, in _createMetadata
metadata = self._inputCreator.getMetadata()
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc31-py2.7.egg/oncotator/input/VcfInputMutationCreator.py", line 460, in getMetadata
metadata = self._createMetadata()
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc31-py2.7.egg/oncotator/input/VcfInputMutationCreator.py", line 447, in _createMetadata
self._createConfigTable()
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc31-py2.7.egg/oncotator/input/VcfInputMutationCreator.py", line 251, in _createConfigTable
for sample in variant.samples:
TypeError: 'NoneType' object is not iterable

Maf => VCF should create Sample columns

Converting from a maf to a vcf should generate a vcf with a sample column for every sample in the maf.

Sample names would be expected in either the sampleName column or Tumor_Sample_Barcode column.

A configuration file would specify which fields should be treated as FORMAT fields instead of INFO fields.

All lines mutations in the maf which occur at the the same position would be aggregated into 1 line in the vcf. Fields specified as format fields would be filled for each mutation in the appropriate sample column.

Example:

Certain fields probably have to have special handling, ex. dbSNP_RS should have special handling for populating the id field.
Genotypes should be derived from the Tumor_Seq_Allele1 and 2 fields.
config file (ignore the syntax if it doesn't match what exists already):

[FORMAT]
NORM: Matched_Normal_Sample_Barcode
DP: Read Depth

example maf:

Chromosome  Start_position  End_position    Reference_Allele    Tumor_Seq_Allele1   Tumor_Seq_Allele2   dbSNP_RS    Tumor_Sample_Barcode    Matched_Norm_Sample_Barcode ReadDepth 
20  14370   14370   G   G   A   rs6054257   NA0001  NA0001-Normal   10
20  14370   14370   G   A   A   rs6054257   NA0002  NA0002-Normal   8
21  123090  123090  T   T   C       NA00001 NA0001-NORMAL   12

expected output vcf

##fileformat=VCFv4.1
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA0001  NA0002       
20  14370   rs6054257   G   A   .   PASS    GT:DP:NORM  0/1:10:NA0001-Normal    1/1:8:NA0002-Normal
21  123090      T   C   .   PASS    GT:DP:NORM  0/1:12:NA001-NORMAL .:.:.

VCF Output renderer and GLxxxxx.x contigs.

Contigs that start with GL need to be enclosed with < > in a vcf on the output. This needs to be implemented.

DEL encompassing entire exon should be a new variant classification

Currently, if a deletion overlaps a splice site in any way (even encompasses a splice site) then the vc will be Splice_Site.

However, if the deletion deletes one or more exons, entirely, there should be a new variant classification for this. Currently, this would just show up as a splice site.

Adding DNP, TNP, MNP logic back into Oncotator

While technically not correct, we would like Oncotator to fold neighboring SNPs into DNP, TNP, or MNP. Though this should be done in the mutation caller, we can assume that all neighboring SNPs are xNP.

This maybe should be a post- or pre- processing step, but that is open for discussion.

initializeDatasource doc updates

The documentation has become stale, especially in the usage examples, given the latest merge. These need to be updated.

Also, examples using indexed tsv and indexed vcf would be appreciated. A new user needs to be able to put in initializeDatasource --help and be on their way.

get_summary_output_string utility function is broken

As it is written, get_summary_output_string function will clobber non-unique input strings. E.g. input of ['1','1','1'] will return '1' instead of '1|1|1'.

Default command line values should be read from a config file

Currently, default values for CLI parameters are hardcoded in Oncotator.py. This can cause confusion to external users, who will be unfamiliar with these paths and may get extraneous log messages when these paths are not found.

"too many values to unpack"

1.0.0rc31

running here: /cga/tcga-gdac/germline/callingWith1kg/ver4/calling/data/debugOncotator/

oncotator -v -i VCF -o VCF kiezun_cancer_germline.sites_only.head146.vcf kiezun_cancer_germline.sites_only.head146.oncotated.vcf hg19

I get this:
Verbose mode on
Path:
['/xchip/tcga/Tools/oncotator/onco_env/bin', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/pip-1.2.1-py2.7.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/distribute-0.6.15-py2.7.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/Cython-0.17.4-py2.7-linux-x86_64.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/biopython-1.60-py2.7-linux-x86_64.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/pandas-0.10.0-py2.7-linux-x86_64.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/pytz-2012j-py2.7.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/python_dateutil-2.1-py2.7.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/six-1.2.0-py2.7.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/SQLAlchemy-0.8.0b2-py2.7-linux-x86_64.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/shove-0.5.6-py2.7.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/stuf-0.9.4-py2.7.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/futures-2.1.3-py2.7.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/parse-1.4.1-py2.7.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/python_memcached-1.53-py2.7.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/nose-1.3.0-py2.7.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/bcbio_gff-0.2-py2.7.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/pysam-0.7.5-py2.7-linux-x86_64.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc31-py2.7.egg', '/broad/software/free/Linux/redhat_5_x86_64/pkgs/mercurial_1.9.3-python-2.7.1-sqlite3-rtrees/lib/python2.7/site-packages', '/broad/software/free/Linux/redhat_5_x86_64/pkgs/matplotlib_1.1.1-python-2.7.1-sqlite3-rtrees/lib/python2.7/site-packages', '/xchip/tcga/Tools/oncotator/onco_env/lib/python27.zip', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/plat-linux2', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/lib-tk', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/lib-old', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/lib-dynload', '/broad/software/free/Linux/redhat_5_x86_64/pkgs/python_2.7.1-sqlite3-rtrees/lib/python2.7', '/broad/software/free/Linux/redhat_5_x86_64/pkgs/python_2.7.1-sqlite3-rtrees/lib/python2.7/plat-linux2', '/broad/software/free/Linux/redhat_5_x86_64/pkgs/python_2.7.1-sqlite3-rtrees/lib/python2.7/lib-tk', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages']

2013-11-06 12:46:27,540 INFO [oncotator.Oncotator:178] Args: Namespace(cache_url=None, dbDir='/xchip/cga/reference/annotation/db/oncotator_v1_ds/', default_cli=[], default_config=None, genome_build='hg19', input_file='kiezun_cancer_germline.sites_only.head146.vcf', input_format='VCF', noMulticore=False, output_file='kiezun_cancer_germline.sites_only.head146.oncotated.vcf', output_format='VCF', override_cli=[], override_config=None, read_only_cache=False, skip_no_alt=False, tx_mode='CANONICAL', verbose=1)
2013-11-06 12:46:27,540 INFO [oncotator.Oncotator:179] Log file: /cga/tcga-gdac/germline/callingWith1kg/ver4/calling/data/debugOncotator/oncotator.log
Traceback (most recent call last):
File "/xchip/tcga/Tools/oncotator/onco_env/ubin/oncotator", line 9, in
load_entry_point('Oncotator==v1.0.0.0rc31', 'console_scripts', 'oncotator')()
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc31-py2.7.egg/oncotator/Oncotator.py", line 220, in main
is_skip_no_alts=is_skip_no_alts)
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc31-py2.7.egg/oncotator/utils/OncotatorCLIUtils.py", line 328, in create_run_spec
inputCreator = OncotatorCLIUtils.create_input_creator(inputFilename, inputFormat)
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc31-py2.7.egg/oncotator/utils/OncotatorCLIUtils.py", line 296, in create_input_creator
inputCreator = inputCreatorDict[inputFormat][0](inputFilename, inputConfig)
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc31-py2.7.egg/oncotator/input/VcfInputMutationCreator.py", line 86, in init
self.vcf_reader = vcf.Reader(filename=self.filename, strict_whitespace=True)
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/vcf/parser.py", line 225, in init
self._parse_metainfo()
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/vcf/parser.py", line 267, in _parse_metainfo
key, val = parser.read_meta(line)
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/vcf/parser.py", line 166, in read_meta
return self.read_meta_hash(meta_string)
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/vcf/parser.py", line 161, in read_meta_hash
val = OrderedDict(item.split("=") for item in hashItems)
File "/broad/software/free/Linux/redhat_5_x86_64/pkgs/python_2.7.1-sqlite3-rtrees/lib/python2.7/collections.py", line 74, in init
self.update(_args, *_kwds)
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/_abcoll.py", line 499, in update
for key, value in other:
ValueError: too many values to unpack

MAF start and end position

In writing VCF to TSV, the end position is smaller than start position for deletions.

VCF --> MAF does not change the ref and alt fields.

For example, VCF will have TA /T which should be -/A in a MAF, but will show up as TA/T in the MAF.

Do not forget that the position must be modified as well.

This should only affect indels, not SNPs.

VCF to MAF: Seems to be ignoring altAlleleSeen when generating MAF, resulting in extraneous mutations in the MAF

enable renaming of log file

it seems to always use oncotator.log, is that right? I'd like to pick my own name to avoid overriding existing logs.

GENCODE Support

Make GENCODE the default for Oncotator instead of GAF 3.0 for hg19.

This also includes better support for generic transcript datasources.

VCFIn --> TCGA MAF out ... Start position is higher than end position

Oncotator.py -v -i VCF /bulk/vcf_oncotest/bcgsc.ca.TCGA-BJ-A0YZ.bcgsc.ca.1.0.0.snv.vcf /bulk/vcf_oncotest/bcgsc.ca.TCGA-BJ-A0YZ.bcgsc.ca.1.0.0.snv.out.maf.annotated hg19 --db-dir=/home/lichtens/oncotator_ds

Input file is:

/xchip/cga2/mara/projects/thca/cross_center_comparison/bcgsc.ca_THCA.IlluminaHiSeq_DNASeq.Level_2.1.0.0/bcgsc.ca.TCGA-BJ-A0YZ.bcgsc.ca.1.0.0.snv.vcf

Tabix indexed TSV/VCF issue

Provide option for averaging, pipe delimiter and exact match for mutation.

Incorrect variant annotation

Oncotator annotates the following SNPs (chr8:86126753, chr1:109472596, and chr4:190903677) as Splice even though they lie in the UTR region.

Comments in MAF file break maf -> vcf

Many maf files have comment lines before the header, delineated by pound signs.

These files cause a crash when run in maf -> vcf mode.

Option so that altAlleleSeen of False will skip annotation

Currently, mutations in TCGA MAF when annotation altAlleleSeen is False will not be rendered. However, these will still be annotated and can cause a lot of time to be wasted.

Move the option to skip these mutations into the Annotator.
Add option to the command line (if option is specified, skip the altAlleleSeen==false mutations)
If output is VCF and option is specified, throw a warning.

Defaults attributes in MutationData

Accessing attributes (chr, start, end, ref_allele, alt_allele, and build) for a given MutationData instance mut via lookup (i.e., mut["ref_allele"]) is a BAD idea. This is primarily because I am modifying the default attributes that are used instantiate mut via the dot operator.

Oncotator hangs when there are no multicore datasources, but multicore was specified.

If there are no multicore datasources, then Oncotator should simply not call the multicore datasource initialization code.

MAFLITE to VCF: 'NoneType' has no attribute group

This error comes up when one (or more) of the comment lines is blank. Solution is to eliminate these from the output VCF.

VcfInputMutationCreator null exception

The VcfInputMutationCreator fails when attempting to parse a vcf file with no sample names but has INFO column.

Caching needs to know about a TranscriptProvider datasource mode.

Since the entire mutation is cached, we need to know whether a given TP ds is in EFFECT, CANONICAL, etc. This has to go into the key.

ENSEMBL support

ENSEMBL transcript datasource that can be initialized for an arbitrary genome build. This will mostly be used for mouse.

Oncotator will produce Non-Coding_Transcript variant classification for MAF, though that is invalid

Oncotator should have a variant classification of RNA instead of Non-coding_Transcript for MAF output.

vcf->vcf annotation should add a line to the header about Oncotator

1.0.0 rc31
running oncotator vcf->vcf should add a line to the header about Oncotator to track provenance of the file.

Support GAF 2.1

Due to some rather large issues with GAF 3.0, make sure that Oncotator v1 can support GAF 2.1.

alt_allele1 and alt_allele2 confusion in Maflite IC

Oncotator should choose the alternate that is different from the reference between all fields that can be construed as alt_allele. Not just choose alt_allele1.

Current master branch version crashes if --default_config and --override_config are not provided

if --default_config and --override_config are not provided, then default values are None which causes a TypeError at lines 192 and 200 in Oncotator.py.

Distinguishing between indels in dbSNP and indels overlapping dbSNP sites

A field distinguishing whether and indel is actually IN dbSNP or if it just overlaps dbSNP sites would be useful.

vcf db annotation for indels

Indels aren't properly annotated from vcf db especially in cases where there are multiple alternates.

Incorrect Cosmic annotations due to datasource having negative strand nucleotide alleles

The new cosmic datasource uses the Generic_GenomicMutation_Datasource class which matches on genomic position plus reference and alt alleles. The Cosmic datasource uses negative strand nucleotide alleles for negative strand alleles. Thus mutations within negative strand genes will fail to annotate because positive strand alleles will be used in the lookup while the datasource will have negative strand alleles. Common KRAS mutations such as p.G12D (chr12:25398281-25398281 C>T) are missing cosmic annotations because of this.

VCF In --> TCGA MAF out ... Double mutations for Strelka VCFs

When a vcf has tumor-normal pairs and no GT information, oncotator is representing the tumor and normal as two separate independent samples. This means that each mutation is repeated twice ... once for the tumor and once for the normal.

The SAMPLE (or TUMOR or NORMAL) header could be used to detect this situation and correct it.

GENCODE/ENSEMBL datasource genome position and gene indices should have values of transcript_id

Currently, those two indices return full blown Transcript instances.

Instead, those should return transcript ids and then the transcript ids can be used to retrieve the Transcripts from the basic transcript index attribute in the EnsemblDatasrouce

Duplicate annotation in dbSNP datasource error.

This happens when the db snp columns are populated in the input already.

$ oncotator -o VCF -i MAFLITE /xchip/cga_home/mara/projects/kdb/test.maf output.vcf hg19

error:
2013-11-21 09:56:19,352 WARNING [oncotator.utils.ConfigUtils:186] Could not find config file (maflite_input.config). Trying configs/ prepend.
2013-11-21 09:56:19,364 WARNING [oncotator.utils.ConfigUtils:186] Could not find config file (vcf.out.config). Trying configs/ prepend.

2013-11-21 09:57:34,622 WARNING [oncotator.MutationData:119] Attempting to create an annotation multiple times, but with the same value: gene : WASH7P
Traceback (most recent call last):
File "/xchip/tcga/Tools/oncotator/onco_env/ubin/oncotator", line 9, in
load_entry_point('Oncotator==v1.0.0.0rc33', 'console_scripts', 'oncotator')()
File "/xchip/tcga/Tools/oncotator/onco_env2/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc33-py2.7.egg/oncotator/Oncotator.py", line 224, in main
annotator.annotate()
File "/xchip/tcga/Tools/oncotator/onco_env2/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc33-py2.7.egg/oncotator/Annotator.py", line 238, in annotate
filename = self._outputRenderer.renderMutations(mutations, metadata=metadata, comments=comments)
File "/xchip/tcga/Tools/oncotator/onco_env2/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc33-py2.7.egg/oncotator/output/VcfOutputRenderer.py", line 248, in renderMutations
for mutation in mutations:
File "/xchip/tcga/Tools/oncotator/onco_env2/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc33-py2.7.egg/oncotator/Annotator.py", line 247, in _applyManualAnnotations
for m in mutations:
File "/xchip/tcga/Tools/oncotator/onco_env2/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc33-py2.7.egg/oncotator/Annotator.py", line 255, in _applyDefaultAnnotations
for m in mutations:
File "/xchip/tcga/Tools/oncotator/onco_env2/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc33-py2.7.egg/oncotator/Annotator.py", line 302, in _annotate_mutations_using_datasources
m = datasource.annotate_mutation(m)
File "/xchip/tcga/Tools/oncotator/onco_env2/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc33-py2.7.egg/oncotator/datasources.py", line 1019, in annotate_mutation
mutation.createAnnotation(header, set(), self.title)
File "/xchip/tcga/Tools/oncotator/onco_env2/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc33-py2.7.egg/oncotator/MutationData.py", line 121, in createAnnotation
raise DuplicateAnnotationException('Attempting to create an annotation multiple times (' + annotationName + ') with old, new values of (' + str(self.annotations[annotationName].value) + ", " + str(annotationValue) + ")")
oncotator.DuplicateAnnotationException.DuplicateAnnotationException: 'Attempting to create an annotation multiple times (dbSNP_RS) with old, new values of (, set([]))'

Reference tag in input vcf causes a failure

The reference tag gets rendered as an OrderedDict. Then, the getComments method on the VcfInputCreator is unable to render this as a comment.

##reference=<ID=hg19,Source=http://www.bcgsc.ca/downloads/genomes/Homo_sapiens/hg19/1000genomes/bwa_ind/genome/GRCh37-lite.fa>

File "/bulk/oncotator_git_pycharm/oncotator/Oncotator.py", line 175, in main annotator.annotate()
File "/bulk/oncotator_git_pycharm/oncotator/Annotator.py", line 198, in annotate
comments = self._createComments()
File "/bulk/oncotator_git_pycharm/oncotator/Annotator.py", line 168, in _createComments
comments = self._inputCreator.getComments()
File "/bulk/oncotator_git_pycharm/oncotator/input/VcfInputMutationCreator.py", line 375, in getComments
File "/usr/lib/python2.7/string.py", line 318, in join
return sep.join(words)
TypeError: sequence item 1: expected string, OrderedDict found

As a temporary workaround, the following code was added to the master branch (VcfInputMutationCreator line 375), but (please confirm that) this workaround will make reconstruction of the reference header impossible:
if isinstance(val, dict): val = string.join(map(str, val), ";")

Set default FORMAT fields for MAF -> VCF

@elephanthunter
@marawr

Here's a few fields that I think should be FORMAT by default. I'm sure there are many others though. People should chime in with their favorite fields. (I just don't know what a number of the fields are, so it's hard to say if they are sample specific or site specific.)

i_tumor_f
i_init_t_lod
i_t_lod_fstar
t_alt_count
t_ref_count
i_judgement

Oncotator will display an extraneous (and incorrect) log message that caching has been disabled even when it is enabled.

This is a minor issue.

The erroneous log message appears, but a later log message will indicate that the caching has been initialized with user parameters.

This is due to the CacheManager constructor intializing a DummyCache.

Fix is to simply move log message into the CacheManager and out of DummyCache.

IGR upstream downstream not rendering properly in GENCODE/ENSEMBL datasource

This is immediately obvious when looking at an IGR mutation.

The port of this code from the GAF datasource is not complete.

No CLI testing in the CI server

There is no testing of the CLI in the CI server (run_ci_tests.sh) and if an error is in Oncotator.py, it will not turn up until the command line is run.

The solution is to create another test script (run_ci_cli_test.sh) that calls some simple Oncotator commands (oncotator --help, included) and make sure that none return a non-zero code. This can then be added to the tasks in the ci server.

Quick check for valid maflite and/or vcf input files

Currently, all datasources are loaded before anything is read from the input file. This is annoying when an invalid input file is specified, since the user must wait for the datasource init only to have the oncotator run fail. This has come up a couple of times from beta users and Appistry QA.

At the very least:

Check that there is a valid maflite with a proper header and necessary columns (or aliases) are present.
#1, except for VCF.

annotating a big vcf is slow

1.0.0 rc31
using oncotator on a big vcf is too slow mostly due to the fact that operations on genotypes are performed on each genotpyes (and there are billions of those) while really only the INFO field is of interest (there are many orders of magnitude fewer of those)

So my current "solution" is this:

subset the vcf to sites only (actually due to a bug in rc31 subset to 1 sample)
annotate that
put the annotated vcf back together with a bit of grep cut and paste

can you make this into an officially supported script or use this "trick" under the hood to make oncotator work more efficiently on vcf->vcf?

MAF to VCF Output should crash (in VCFOutput) if unable to find preceding_bases nor ref_context annotations

This makes it impossible to render a VCF properly for indels. Oncotator (in VCF Output Renderer) should exit gracefully with an informative error message that Oncotator is misconfigured and the ref_hg datasource should be included if any indels are in the input MAF.

Oncotated files contain non-ascii characters which cause confusing issues downstream

I noticed that oncotator outputs annotations which include accented words like Sjögren syndrome. While it's admirable to accent things properly, I'm not sure it's a good idea. A lot of unix programs don't understand these characters, and it can cause very confusing issues. R for instance happily reads in the file, but silently mangles it.

use reference for annotating with indexed tsv datasources

Tabix indexed VCF issue

ESP annotations from a tabix indexed VCF are done per variant basis and not per alt allele basis.

Incorrect annotation from VCF input

Oncotator is annotating the following vcf input incorrectly:

INPUT:

CHROM POS ID REF ALT

chr1 1645838 rs33938712 CAA CA,AA

OUTPUT:

Chromosome Start_position End_position Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2
1 1645839 1645838 DEL AA AA -
1 1645839 1645838 DEL AA AA -

Correct output should be:

Chromosome Start_position End_position Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2
1 1645840 1645840 DEL A A -
1 1645838 1645838 DEL C C -

This bug is observed even if create_oncotator_venv.sh is used prior to Oncotator installation.