Coder Social home page Coder Social logo

oncotator's Introduction

image

image

Oncotator Is No Longer Supported or Maintained
Funcotator is a new Functional Annotation tool (the spiritual successor to Oncotator). It is:
  • Better: many bugs have been fixed and edge cases have been improved.
  • Faster: annotate more variants in less time
  • Easier to use and deploy: a single jar with no tricky installation or dependencies, and a tool for fetching the datasources
Funcotator is available as part of the GATK toolkit and works out of the box with both Mutect2 and HaplotypeCaller for somatic and germline annotation, respectively.

It also has a Featured Workspace on Terra.

A Funcotator tutorial, as well as a full comparison of Funcotator and Oncotator and other helpful information can be found on the GATK website here:

https://gatk.broadinstitute.org/hc/en-us/articles/360035889931-Funcotator-Information-and-Tutorial

The github repository for GATK and Funcotator can be found here:

https://github.com/broadinstitute/gatk


======================
Oncotator

License

Oncotator is free for non-profit users. Please see the LICENSE file here for more information.

Package Overview

The name of the directory, oncotator, is also the name of the distribution. This distribution contains the oncotator package.

For more information: http://www.broadinstitute.org/cancer/cga/oncotator

This distribution is the standalone version of Oncotator. If you wish to use the web interface: http://www.broadinstitute.org/oncotator

Please note that the web interface uses an older version of Oncotator and older datasources.

All documentation can be found in the Oncotator forums: http://gatkforums.broadinstitute.org/categories/oncotator

Installation

Currently, Windows is unsupported, though this is due to a dependency, pysam, being unsupported in Windows.

IMPORTANT: You will need root access to your python interpreter or a python virtual environment. More information about virtual environments can be found on the following site: https://pypi.python.org/pypi/virtualenv

As a reminder, virtualenv.py can be run as a standalone script, thereby bypassing superuser requirements. Please see the above link for more details.

Before installing, we recommend installing pyvcf and numpy manually, before attempting the Oncotator install. You may need to prepend each of the following commands with sudo:

$ pip install numpy
$ pip install pyvcf

This distribution is installable through the standard setup.py method. Note that Distribute will be installed as part of the setup process if it isn't already:

$ python setup.py install

Because the setup.py specifies an entry point as a console script, oncotator and initializeDatasource will be installed into your Python's bin/ directory

Unit Test Setup

NOTE: Unit tests require a minimum of 4GB to run.

Before running the unit tests for the first time, please perform the following steps:

  1. Execute the following three lines in the same directory as setup.py:

    $ mkdir -p out
    $ ln -s test/configs configs
    $ ln -s test/testdata testdata
  2. Many unit tests rely on having the standard set of hg19 datasources, which are in a separate download. To point the unit testing framework to your datasources, you must create a personal test config:

    $ cp configs/personal-test.config.template configs/personal-test.config
    In configs/personal-test.config, replace ```dbDir=MY_DB_DIR/``` with ```dbDir=``` the appropriate path to you oncotator datasource directory.

Running the Automated Unit Tests (with Virtual Env Creation) --------------------The automated unit tests (run_ci_tests.sh) require 6 GB to run. This can take a fair amount of time (~20 minutes), since a full install into a new virtual environment is performed.

Execute the following line in the same directory as setup.py (provide the appropriate path to the db dir with your datasources):

$ bash run_ci_tests.sh <DB_DIR>

Running the Automated Unit Tests (without Virtual Env Creation) --------------------You can simply run the unit tests in the currently active python environment, which takes a lot less time (< 6 minutes), but requires all dependencies to be installed. However, you must follow the instructions for Unit Test Setup above (Steps 1 and 2), if not already performed. Then run (in the same directory as setup.py):

$ nosetests --all-modules --exe -w test -v --processes=4 --process-timeout=480  --process-restartworker

Please note that there is a known bug with --processes and output to XML. If you alter the above nosetests command to include junit xml (--with-xunit), remove the last three options (`--processes=4 --process-timeout=480 --process-restartworker`). This will cause tests to only run on one core.

Creating a Virtual Environment for Running Oncotator --------------------Follow these steps from the same directory as setup.py. The first command will take several minutes:

bash scripts/create_oncotator_venv.sh <venv_location>
source <venv_location>/bin/activate
python setup.py install

Version Information

Once Oncotator is installed, run it with the -V flag to get version information:

$ Oncotator -V

Git Process Starting with v1.0.0.0 (Developers)

For an overview on the oncotator process for adding features, bugfixes, and general day-to-day branching, please see:: http://nvie.com/posts/a-successful-git-branching-model/

Help

Please post questions, issues, and feature requests in the forum at http://gatkforums.broadinstitute.org/categories/oncotator

oncotator's People

Contributors

alexramos avatar elephanthunter avatar jonn-smith avatar lbergelson avatar leetl1220 avatar mgupta0704 avatar rgarcia-herrera avatar scottfrazer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

oncotator's Issues

Contig and alt issue

Contigs and alts in the header are parsed incorrectly, and thus, do not appear in the output file.

tsv file sorting needs to have RAM usage verified -- run on large file

We cannot close out the tsv file sorting in oncotator until a large file has been run and RAM usage monitored.

This may require a test that is not in the oncotator install package.

The maf file in the Kryuokov challenge would be a good one, but may need to have its comments and header trimmed.

This will also allow us to collect some timing information.

: 'NoneType' object is not iterable

1.0.0rc31
in here: /cga/tcga-gdac/germline/resources/esp6500SIdataV2/debugOncotator/

i run:
oncotator -i VCF -o VCF ESP6500SI-V2.snps_indels.head34.vcf ESP6500SI-V2.snps_indels.head34.oncotated.vcf hg19

and get this:

oncotator -i VCF -o VCF ESP6500SI-V2.snps_indels.head34.vcf ESP6500SI-V2.snps_indels.head34.oncotated.vcf hg19
2013-11-06 13:52:47,864 WARNING [oncotator.utils.ConfigUtils:186] Could not find config file (vcf.in.config). Trying configs/ prepend.
2013-11-06 13:52:47,871 WARNING [oncotator.utils.ConfigUtils:186] Could not find config file (vcf.out.config). Trying configs/ prepend.
Traceback (most recent call last):
File "/xchip/tcga/Tools/oncotator/onco_env/ubin/oncotator", line 9, in
load_entry_point('Oncotator==v1.0.0.0rc31', 'console_scripts', 'oncotator')()
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc31-py2.7.egg/oncotator/Oncotator.py", line 224, in main
annotator.annotate()
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc31-py2.7.egg/oncotator/Annotator.py", line 236, in annotate
metadata = self._createMetadata()
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc31-py2.7.egg/oncotator/Annotator.py", line 195, in _createMetadata
metadata = self._inputCreator.getMetadata()
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc31-py2.7.egg/oncotator/input/VcfInputMutationCreator.py", line 460, in getMetadata
metadata = self._createMetadata()
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc31-py2.7.egg/oncotator/input/VcfInputMutationCreator.py", line 447, in _createMetadata
self._createConfigTable()
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc31-py2.7.egg/oncotator/input/VcfInputMutationCreator.py", line 251, in _createConfigTable
for sample in variant.samples:
TypeError: 'NoneType' object is not iterable

Maf => VCF should create Sample columns

Converting from a maf to a vcf should generate a vcf with a sample column for every sample in the maf.

Sample names would be expected in either the sampleName column or Tumor_Sample_Barcode column.

A configuration file would specify which fields should be treated as FORMAT fields instead of INFO fields.

All lines mutations in the maf which occur at the the same position would be aggregated into 1 line in the vcf. Fields specified as format fields would be filled for each mutation in the appropriate sample column.

Example:

Certain fields probably have to have special handling, ex. dbSNP_RS should have special handling for populating the id field.
Genotypes should be derived from the Tumor_Seq_Allele1 and 2 fields.
config file (ignore the syntax if it doesn't match what exists already):

[FORMAT]
NORM: Matched_Normal_Sample_Barcode
DP: Read Depth

example maf:

Chromosome  Start_position  End_position    Reference_Allele    Tumor_Seq_Allele1   Tumor_Seq_Allele2   dbSNP_RS    Tumor_Sample_Barcode    Matched_Norm_Sample_Barcode ReadDepth 
20  14370   14370   G   G   A   rs6054257   NA0001  NA0001-Normal   10
20  14370   14370   G   A   A   rs6054257   NA0002  NA0002-Normal   8
21  123090  123090  T   T   C       NA00001 NA0001-NORMAL   12

expected output vcf

##fileformat=VCFv4.1
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA0001  NA0002       
20  14370   rs6054257   G   A   .   PASS    GT:DP:NORM  0/1:10:NA0001-Normal    1/1:8:NA0002-Normal
21  123090      T   C   .   PASS    GT:DP:NORM  0/1:12:NA001-NORMAL .:.:.

DEL encompassing entire exon should be a new variant classification

Currently, if a deletion overlaps a splice site in any way (even encompasses a splice site) then the vc will be Splice_Site.

However, if the deletion deletes one or more exons, entirely, there should be a new variant classification for this. Currently, this would just show up as a splice site.

Adding DNP, TNP, MNP logic back into Oncotator

While technically not correct, we would like Oncotator to fold neighboring SNPs into DNP, TNP, or MNP. Though this should be done in the mutation caller, we can assume that all neighboring SNPs are xNP.

This maybe should be a post- or pre- processing step, but that is open for discussion.

initializeDatasource doc updates

The documentation has become stale, especially in the usage examples, given the latest merge. These need to be updated.

Also, examples using indexed tsv and indexed vcf would be appreciated. A new user needs to be able to put in initializeDatasource --help and be on their way.

"too many values to unpack"

1.0.0rc31

running here: /cga/tcga-gdac/germline/callingWith1kg/ver4/calling/data/debugOncotator/

oncotator -v -i VCF -o VCF kiezun_cancer_germline.sites_only.head146.vcf kiezun_cancer_germline.sites_only.head146.oncotated.vcf hg19

I get this:
Verbose mode on
Path:
['/xchip/tcga/Tools/oncotator/onco_env/bin', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/pip-1.2.1-py2.7.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/distribute-0.6.15-py2.7.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/Cython-0.17.4-py2.7-linux-x86_64.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/biopython-1.60-py2.7-linux-x86_64.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/pandas-0.10.0-py2.7-linux-x86_64.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/pytz-2012j-py2.7.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/python_dateutil-2.1-py2.7.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/six-1.2.0-py2.7.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/SQLAlchemy-0.8.0b2-py2.7-linux-x86_64.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/shove-0.5.6-py2.7.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/stuf-0.9.4-py2.7.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/futures-2.1.3-py2.7.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/parse-1.4.1-py2.7.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/python_memcached-1.53-py2.7.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/nose-1.3.0-py2.7.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/bcbio_gff-0.2-py2.7.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/pysam-0.7.5-py2.7-linux-x86_64.egg', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc31-py2.7.egg', '/broad/software/free/Linux/redhat_5_x86_64/pkgs/mercurial_1.9.3-python-2.7.1-sqlite3-rtrees/lib/python2.7/site-packages', '/broad/software/free/Linux/redhat_5_x86_64/pkgs/matplotlib_1.1.1-python-2.7.1-sqlite3-rtrees/lib/python2.7/site-packages', '/xchip/tcga/Tools/oncotator/onco_env/lib/python27.zip', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/plat-linux2', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/lib-tk', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/lib-old', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/lib-dynload', '/broad/software/free/Linux/redhat_5_x86_64/pkgs/python_2.7.1-sqlite3-rtrees/lib/python2.7', '/broad/software/free/Linux/redhat_5_x86_64/pkgs/python_2.7.1-sqlite3-rtrees/lib/python2.7/plat-linux2', '/broad/software/free/Linux/redhat_5_x86_64/pkgs/python_2.7.1-sqlite3-rtrees/lib/python2.7/lib-tk', '/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages']

2013-11-06 12:46:27,540 INFO [oncotator.Oncotator:178] Args: Namespace(cache_url=None, dbDir='/xchip/cga/reference/annotation/db/oncotator_v1_ds/', default_cli=[], default_config=None, genome_build='hg19', input_file='kiezun_cancer_germline.sites_only.head146.vcf', input_format='VCF', noMulticore=False, output_file='kiezun_cancer_germline.sites_only.head146.oncotated.vcf', output_format='VCF', override_cli=[], override_config=None, read_only_cache=False, skip_no_alt=False, tx_mode='CANONICAL', verbose=1)
2013-11-06 12:46:27,540 INFO [oncotator.Oncotator:179] Log file: /cga/tcga-gdac/germline/callingWith1kg/ver4/calling/data/debugOncotator/oncotator.log
Traceback (most recent call last):
File "/xchip/tcga/Tools/oncotator/onco_env/ubin/oncotator", line 9, in
load_entry_point('Oncotator==v1.0.0.0rc31', 'console_scripts', 'oncotator')()
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc31-py2.7.egg/oncotator/Oncotator.py", line 220, in main
is_skip_no_alts=is_skip_no_alts)
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc31-py2.7.egg/oncotator/utils/OncotatorCLIUtils.py", line 328, in create_run_spec
inputCreator = OncotatorCLIUtils.create_input_creator(inputFilename, inputFormat)
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc31-py2.7.egg/oncotator/utils/OncotatorCLIUtils.py", line 296, in create_input_creator
inputCreator = inputCreatorDict[inputFormat][0](inputFilename, inputConfig)
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc31-py2.7.egg/oncotator/input/VcfInputMutationCreator.py", line 86, in init
self.vcf_reader = vcf.Reader(filename=self.filename, strict_whitespace=True)
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/vcf/parser.py", line 225, in init
self._parse_metainfo()
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/vcf/parser.py", line 267, in _parse_metainfo
key, val = parser.read_meta(line)
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/vcf/parser.py", line 166, in read_meta
return self.read_meta_hash(meta_string)
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/site-packages/vcf/parser.py", line 161, in read_meta_hash
val = OrderedDict(item.split("=") for item in hashItems)
File "/broad/software/free/Linux/redhat_5_x86_64/pkgs/python_2.7.1-sqlite3-rtrees/lib/python2.7/collections.py", line 74, in init
self.update(_args, *_kwds)
File "/xchip/tcga/Tools/oncotator/onco_env/lib/python2.7/_abcoll.py", line 499, in update
for key, value in other:
ValueError: too many values to unpack

enable renaming of log file

it seems to always use oncotator.log, is that right? I'd like to pick my own name to avoid overriding existing logs.

GENCODE Support

Make GENCODE the default for Oncotator instead of GAF 3.0 for hg19.

This also includes better support for generic transcript datasources.

VCFIn --> TCGA MAF out ... Start position is higher than end position

Oncotator.py -v -i VCF /bulk/vcf_oncotest/bcgsc.ca.TCGA-BJ-A0YZ.bcgsc.ca.1.0.0.snv.vcf /bulk/vcf_oncotest/bcgsc.ca.TCGA-BJ-A0YZ.bcgsc.ca.1.0.0.snv.out.maf.annotated hg19 --db-dir=/home/lichtens/oncotator_ds

Input file is:

/xchip/cga2/mara/projects/thca/cross_center_comparison/bcgsc.ca_THCA.IlluminaHiSeq_DNASeq.Level_2.1.0.0/bcgsc.ca.TCGA-BJ-A0YZ.bcgsc.ca.1.0.0.snv.vcf

Incorrect variant annotation

Oncotator annotates the following SNPs (chr8:86126753, chr1:109472596, and chr4:190903677) as Splice even though they lie in the UTR region.

Option so that altAlleleSeen of False will skip annotation

Currently, mutations in TCGA MAF when annotation altAlleleSeen is False will not be rendered. However, these will still be annotated and can cause a lot of time to be wasted.

  1. Move the option to skip these mutations into the Annotator.
  2. Add option to the command line (if option is specified, skip the altAlleleSeen==false mutations)
  3. If output is VCF and option is specified, throw a warning.

Defaults attributes in MutationData

Accessing attributes (chr, start, end, ref_allele, alt_allele, and build) for a given MutationData instance mut via lookup (i.e., mut["ref_allele"]) is a BAD idea. This is primarily because I am modifying the default attributes that are used instantiate mut via the dot operator.

ENSEMBL support

ENSEMBL transcript datasource that can be initialized for an arbitrary genome build. This will mostly be used for mouse.

Support GAF 2.1

Due to some rather large issues with GAF 3.0, make sure that Oncotator v1 can support GAF 2.1.

Incorrect Cosmic annotations due to datasource having negative strand nucleotide alleles

The new cosmic datasource uses the Generic_GenomicMutation_Datasource class which matches on genomic position plus reference and alt alleles. The Cosmic datasource uses negative strand nucleotide alleles for negative strand alleles. Thus mutations within negative strand genes will fail to annotate because positive strand alleles will be used in the lookup while the datasource will have negative strand alleles. Common KRAS mutations such as p.G12D (chr12:25398281-25398281 C>T) are missing cosmic annotations because of this.

VCF In --> TCGA MAF out ... Double mutations for Strelka VCFs

When a vcf has tumor-normal pairs and no GT information, oncotator is representing the tumor and normal as two separate independent samples. This means that each mutation is repeated twice ... once for the tumor and once for the normal.

The SAMPLE (or TUMOR or NORMAL) header could be used to detect this situation and correct it.

Duplicate annotation in dbSNP datasource error.

This happens when the db snp columns are populated in the input already.

$ oncotator -o VCF -i MAFLITE /xchip/cga_home/mara/projects/kdb/test.maf output.vcf hg19

error:
2013-11-21 09:56:19,352 WARNING [oncotator.utils.ConfigUtils:186] Could not find config file (maflite_input.config). Trying configs/ prepend.
2013-11-21 09:56:19,364 WARNING [oncotator.utils.ConfigUtils:186] Could not find config file (vcf.out.config). Trying configs/ prepend.

2013-11-21 09:57:34,622 WARNING [oncotator.MutationData:119] Attempting to create an annotation multiple times, but with the same value: gene : WASH7P
Traceback (most recent call last):
File "/xchip/tcga/Tools/oncotator/onco_env/ubin/oncotator", line 9, in
load_entry_point('Oncotator==v1.0.0.0rc33', 'console_scripts', 'oncotator')()
File "/xchip/tcga/Tools/oncotator/onco_env2/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc33-py2.7.egg/oncotator/Oncotator.py", line 224, in main
annotator.annotate()
File "/xchip/tcga/Tools/oncotator/onco_env2/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc33-py2.7.egg/oncotator/Annotator.py", line 238, in annotate
filename = self._outputRenderer.renderMutations(mutations, metadata=metadata, comments=comments)
File "/xchip/tcga/Tools/oncotator/onco_env2/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc33-py2.7.egg/oncotator/output/VcfOutputRenderer.py", line 248, in renderMutations
for mutation in mutations:
File "/xchip/tcga/Tools/oncotator/onco_env2/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc33-py2.7.egg/oncotator/Annotator.py", line 247, in _applyManualAnnotations
for m in mutations:
File "/xchip/tcga/Tools/oncotator/onco_env2/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc33-py2.7.egg/oncotator/Annotator.py", line 255, in _applyDefaultAnnotations
for m in mutations:
File "/xchip/tcga/Tools/oncotator/onco_env2/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc33-py2.7.egg/oncotator/Annotator.py", line 302, in _annotate_mutations_using_datasources
m = datasource.annotate_mutation(m)
File "/xchip/tcga/Tools/oncotator/onco_env2/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc33-py2.7.egg/oncotator/datasources.py", line 1019, in annotate_mutation
mutation.createAnnotation(header, set(), self.title)
File "/xchip/tcga/Tools/oncotator/onco_env2/lib/python2.7/site-packages/Oncotator-v1.0.0.0rc33-py2.7.egg/oncotator/MutationData.py", line 121, in createAnnotation
raise DuplicateAnnotationException('Attempting to create an annotation multiple times (' + annotationName + ') with old, new values of (' + str(self.annotations[annotationName].value) + ", " + str(annotationValue) + ")")
oncotator.DuplicateAnnotationException.DuplicateAnnotationException: 'Attempting to create an annotation multiple times (dbSNP_RS) with old, new values of (, set([]))'

Reference tag in input vcf causes a failure

The reference tag gets rendered as an OrderedDict. Then, the getComments method on the VcfInputCreator is unable to render this as a comment.

##reference=<ID=hg19,Source=http://www.bcgsc.ca/downloads/genomes/Homo_sapiens/hg19/1000genomes/bwa_ind/genome/GRCh37-lite.fa>

File "/bulk/oncotator_git_pycharm/oncotator/Oncotator.py", line 175, in main annotator.annotate()
File "/bulk/oncotator_git_pycharm/oncotator/Annotator.py", line 198, in annotate
comments = self._createComments()
File "/bulk/oncotator_git_pycharm/oncotator/Annotator.py", line 168, in _createComments
comments = self._inputCreator.getComments()
File "/bulk/oncotator_git_pycharm/oncotator/input/VcfInputMutationCreator.py", line 375, in getComments
File "/usr/lib/python2.7/string.py", line 318, in join
return sep.join(words)
TypeError: sequence item 1: expected string, OrderedDict found

As a temporary workaround, the following code was added to the master branch (VcfInputMutationCreator line 375), but (please confirm that) this workaround will make reconstruction of the reference header impossible:
if isinstance(val, dict): val = string.join(map(str, val), ";")

Set default FORMAT fields for MAF -> VCF

@elephanthunter
@marawr

Here's a few fields that I think should be FORMAT by default. I'm sure there are many others though. People should chime in with their favorite fields. (I just don't know what a number of the fields are, so it's hard to say if they are sample specific or site specific.)

i_tumor_f
i_init_t_lod
i_t_lod_fstar
t_alt_count
t_ref_count
i_judgement

No CLI testing in the CI server

There is no testing of the CLI in the CI server (run_ci_tests.sh) and if an error is in Oncotator.py, it will not turn up until the command line is run.

The solution is to create another test script (run_ci_cli_test.sh) that calls some simple Oncotator commands (oncotator --help, included) and make sure that none return a non-zero code. This can then be added to the tasks in the ci server.

Quick check for valid maflite and/or vcf input files

Currently, all datasources are loaded before anything is read from the input file. This is annoying when an invalid input file is specified, since the user must wait for the datasource init only to have the oncotator run fail. This has come up a couple of times from beta users and Appistry QA.

At the very least:

  1. Check that there is a valid maflite with a proper header and necessary columns (or aliases) are present.

  2. #1, except for VCF.

annotating a big vcf is slow

1.0.0 rc31
using oncotator on a big vcf is too slow mostly due to the fact that operations on genotypes are performed on each genotpyes (and there are billions of those) while really only the INFO field is of interest (there are many orders of magnitude fewer of those)

So my current "solution" is this:

  1. subset the vcf to sites only (actually due to a bug in rc31 subset to 1 sample)
  2. annotate that
  3. put the annotated vcf back together with a bit of grep cut and paste

can you make this into an officially supported script or use this "trick" under the hood to make oncotator work more efficiently on vcf->vcf?

Oncotated files contain non-ascii characters which cause confusing issues downstream

I noticed that oncotator outputs annotations which include accented words like Sjögren syndrome. While it's admirable to accent things properly, I'm not sure it's a good idea. A lot of unix programs don't understand these characters, and it can cause very confusing issues. R for instance happily reads in the file, but silently mangles it.

Tabix indexed VCF issue

ESP annotations from a tabix indexed VCF are done per variant basis and not per alt allele basis.

Incorrect annotation from VCF input

Oncotator is annotating the following vcf input incorrectly:

INPUT:

CHROM POS ID REF ALT

chr1 1645838 rs33938712 CAA CA,AA

OUTPUT:

Chromosome Start_position End_position Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2
1 1645839 1645838 DEL AA AA -
1 1645839 1645838 DEL AA AA -

Correct output should be:

Chromosome Start_position End_position Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2
1 1645840 1645840 DEL A A -
1 1645838 1645838 DEL C C -

This bug is observed even if create_oncotator_venv.sh is used prior to Oncotator installation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.