pcingola / snpeff Goto Github PK

License: Other

Shell 2.31% Perl 3.64% HTML 12.49% CSS 0.65% Python 0.83% R 0.33% Makefile 0.09% Java 78.40% FreeMarker 0.63% JavaScript 0.41% CAP CDS 0.22%

snpeff's People

Contributors

Stargazers

Watchers

Forkers

isanwong ebrevdo leipzig heuermh chapmanb awenocur dlorson b-rich chmille4 emreatbina biocoder-ajdash expressionanalysis ousamg mbourgey davidroberson lindenb alenzhao cmkrueger17 shmumer nathanwilk7 wy2160640 avakel hinerm tojojames pkmyt1 bpow vd4mmind smith-chem-wisc xiasanshi mareq captainnemon seb-leb inambioinfo jayon-lihm karenfeng genomenon quanrd sachalau mcassatt wook2014 kurt-mueller-osumc aroobalhumaidy minjinhan clintval konradotto wangpanqiao borisdmly mariusdanner hyphaltip nandr0id adrienlemeur pageneck dapengzheng nh13 tiledb-inc egeza schaudge angeltgc521 pratikkasar19 kkwock hweesze tongyt1225 apeltzer yehw weiydcn barbarian1803 tayabsoomro jing-xinxing sciencecomputing johnatansiani kojix2 alanhoyle bounlu

snpeff's Issues

Build: Set transcript's protein_coding if there is a CDS record in the GFF/GTF file

Instead of assuming that all transcripts are non-coding by default, we could set as protein coding if the transcript has a CDS.

There might be problems with this approach. For
instance, in the human genome:

$ zcat genes.gtf.gz | grep -w CDS | cut -f 2 | ~/snpEff/scripts/uniqCount.pl
113 IG_C_gene
64 IG_D_gene
24 IG_J_gene
366 IG_V_gene
21 TR_C_gene
3 TR_D_gene
82 TR_J_gene
296 TR_V_gene
461 non_stop_decay
57770 nonsense_mediated_decay
773 polymorphic_pseudogene
731883 protein_coding

So there are 57K nonsense_mediated_decay transcripts that have CDSs, but are assumed not to be coding. As a workaround, we could add this only in cases
where the biotype is unknown (it's a better guess than assuming they are non-

TEST ISSUE

TEST!!!

SnpEff '-h' should show the same help screen as "snpEff" without any command

Now is not consistent (and quite confusing):
-Invoking with '-h' show help for 'eff' command
-Invoking without any command line option shows all avaialble commands.

INTRAGENIC is added even when a transcript is hit

For this input:

chr1    11017092    C   T

The ouput is:

chr1    11017092    C   T   .   .   .   EFF=EXON(MODIFIER||||774|C1orf127|protein_coding|CODING|ENST00000520253|7|1|WARNING_TRANSCRIPT_NO_START_CODON+WARNING_REF_DOES_NOT_MATCH_GENOME),EXON(MODIFIER||||823|C1orf127|protein_coding|CODING|ENST00000377004|8|1|WARNING_REF_DOES_NOT_MATCH_GENOME),INTRAGENIC(MODIFIER|||||C1orf127||CODING|||1),INTRON(MODIFIER||||656|C1orf127|protein_coding|CODING|ENST00000377008|7|1),INTRON(MODIFIER||||657|C1orf127|protein_coding|CODING|ENST00000418570|3|1|WARNING_TRANSCRIPT_NO_START_CODON)

Notice that even if 4 transcripts are hit for this gene, a INTRAGENIC result is added.

In Gene.java line 224, hitTranscript is overwritten for each transcript in the gene,
so if the last transcript is not hit hitTranscript will be false when we exit the loop and because of that an INTRAGENIC effect is added.

So as far as I understood, maybe :

hitTranscript = tr.seqChangeEffect(seqChange, changeEffects);

has to be changed to

hitTranscript |= tr.seqChangeEffect(seqChange, changeEffects);

but maybe I'm missing something :)

Julien

Remove support for reverse seqChange

They create all sorts of problems and solve nothing.
Just remove them.

Mixed variants issues

Looks like we are choking on mixed variants:

chrX 134555866 . CGT AGCT
chrX 152864607 . GCGTG GCGCTG,GCGCTA

Select a few transcript variants without having to pass a full list of all transcript variants

Hi Pablo,
One of our scientists noticed that the effect annotation for BRCA1, when using canonical, is not the transcript variant most often considered to be "standard" (at least according to some!) He'd like to use NM_007294.3 and I found a way to pass this to snpEff through the -onlyTr option. However, I would effectively have to provide all transcript variants for all genes, otherwise all other genes are annotated as being INTRAGENIC.

Is there any way snpEff could select the longest transcript in all other cases except for the transcripts provided in the text file?

Evaluate adding support for MAF files

Some references:

https://www.biostars.org/p/91806/
https://www.biostars.org/p/69222/
https://www.biostars.org/p/86929/
https://www.biostars.org/p/105030/#108482
https://www.biostars.org/p/107744/
https://www.biostars.org/p/74822/
https://www.biostars.org/p/86929/
https://www.biostars.org/p/69222/
https://www.biostars.org/p/108112/

https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/

https://github.com/ckandoth/vcf2maf

Updates to documentation

Just wondering about what appears to be a new annotation, SPLICE_SITE_REGION, which is currently given a LOW impact. I am assuming this is +/- a certain number of base pairs of known splice site donor/acceptor positions. A quick update to the documentation with the provenance and details of the annotation would be appreciated. Thanks!

Natural ordering of chromosomes

Build: Add transcript filed when we checked against CDS and PROTEIN sequences.

Add transcript filed when we checked against CDS and PROTEIN sequences.
Then we can emit popper warnings if the sequence does not check.

Check a transcript using CDS and PROTEIN.
These values have to be serialized (saved).

When using a transcript:

i) If both checks are OK, then we are relatively confident that the Reference Genome
annotations are OK.

ii) If it doesn't check: Add a warning.

iii) If no checking was performed: Show an overall warning.

Add effect: Transcript_deleted

HGSV should be "p.0"

Effect impact should incorporate LOF / MND

what happened to txt output?

// if (outFor.equals("TXT")) outputFormat = OutputFormat.TXT;

Phase correction may produce exons with end < start

This may be the case reported in Maize genome, transcript AC208892.3_FGT005
which leads to errors when creating the summary (genes) file.

Caused by: java.lang.RuntimeException: Interval error: end before start. Start:216323401, End: 216323400
at ca.mcgill.mcb.pcingola.interval.Interval.(Interval.java:32)
at ca.mcgill.mcb.pcingola.interval.Marker.(Marker.java:31)
at ca.mcgill.mcb.pcingola.interval.Markers.merge(Markers.java:221)
at ca.mcgill.mcb.pcingola.interval.Gene.sizeof(Gene.java:385)
at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)

After debugging, this is produced by:

start: 216323401 end:216323400
2:21632 3452-216323702 'Exon_2_216323452_216323703', rank: 3, frame: 2, sequence: aggagagggaggtggtggtggagcaccagcaggaggagccgaggaggagggccccg
2:21632 3401-216323400 'Exon_2_216323400_216323401', rank: 2, frame: 2, sequence:
2:21632 3283-216323328 'Exon_2_216323284_216323329', rank: 1, frame: 0, sequence: atggcggaggaccagaccaaccccagcggcccagccccagcaagcg
2:21632 3945-216324424 'Exon_2_216323946_216324425', rank: 4, frame: 0, sequence: caagggtttatttcgaggaatcaaattaacaacgatgtagtaacaggtgcacgagg

GRCh38: Modeled centromeres

Modeled centromeres have sequences, but are "modeled" (they should be annotated like that)

Duplicate VCF fields and headers are now banned

Duplicate VCF fields and headers are now banned.
SnpEff & SnpSift behaviour must be modified accordingly.

Add database compatibility check

Many people update the software but forget to update the database.
SnpEff should show a simple and clear error message to avoid confusion.

zaire ebola database not available

the snpeff databases command lists the zaire_ebola as an available database but the url leads to a 404

Sequence Ontology by default

Add command line option "-classic" to override.

SnpSift filter: Switch to ANTLR 4

HGVS complianace: Transcript version

User request:

Can you guys output the version of transcript .Example for transcript ENST00000560659 with version it would be ENST00000560659.3 (based on Ensemble core GrCh75.37).The version number is must for the tool to be HGVS complaint.

Add cDNA / mRNA position information

User request.

HGVS: Frame shift

From user:
... I was wondering how difficult it would be to get SNPEFF to report the effects of frameshifts according to the "nomenclature for the description of sequence variants?"

For example, right now for a specific mutation at chr9:139811029 where we have a deletion GT -> G, the amino acid change gets annotated with -214. But this causes a frameshift, and should technically be reported as:

V214Gfs*14

Remove TXT support

"N" in ALT being changed to A,C,G,T - undesirable

I'm having a problem where some of the records in my VCF files have a single N in the ALT field. SnpEff/SnpSift keeps changing these to A,C,G,Ts. Is there any way to turn this feature off in the future? Or for that matter, a feature that makes it so that SnpEff doesn't change anything in the original VCF record besides adding a INFO field.

Example: original VCF record, VCF record after SnpEff

1 123456    rs123  NTGTATT  N

1 123456    rs123  NTGTATT  A,C,G,T

SnpSift filter: Allow generic expressions

Use the "everything is an expression" concept to allow for more generic expressions.

Remove negative SeqChange

Add protein coding field using protein.fa and cds.fa info

Download genome if not available

By default a genome should be downloaded if it is not available.
Add command line option "-nodownload" to override default behaviour.

Summary page: Add Minor Allele Frequency (MAF) per sample

Stats do not work in mutlithreading '-t' mode

Stats do not work on '-t' mode. Multi-threading compatible Stats objects should be implemented.

New summary page: Add AlleleCount (AC)

HGSV notation for introns

Finish test cases.

Contigs and scaffolds names (both numbers) are mixed

From user:
... I am working with Bursaphelenchus xylophilus genome (nematode). The genome consists of contigs and scaffolds, some of them with the same number. SnpEff mixed scaffold with contig if they have the same number, if not, it works fine,....Is it a way to resolve this problem?

HGSV notation by defauls

Add command line option "-classic" to override (same as overriding sequence ontology).

Input format: Remove Pileup support

Summary plot problems: Variation graph is not showing the bar of Exons

User email:
Program runs well for al these 3 files but there is only 1 problem that in snpEff_summary file for the File 3, For the "Number of effects by type and region" portion the Variation graph is not showing the bar of Exons which have 50.226% value. Although its not a big deal and we can generate the graph by our self but I want to know the possible reasons for this.

NOTE: Unfortunately the user doesn't seem keen on providing data to replicate error conditions.

VCF ALT tags <INS>, <DUP>, <INV>

Add support for VCF ALT tags

Script to download and build genome (NCBI)

The script is invoked (from SnpEff's directory) using the NCBI's ID as parameter

./scripts/buildNcbiDatabase.pl 'NC_001788.1'

Step 1: NCBI's page is downloaded in order to scrape the UID

curl http://www.ncbi.nlm.nih.gov/nuccore/NC_001788.1 > NC_001788.1

Step 2: Scrapte UID

...meta name="ncbi_uidlist" content="5835345"...

Step 3: Download GenBank file

curl "http://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?sendto=on&val=5835345" > NC_001788.1.gbk

Step 4: Add config, build db

echo NC_001788.1.genome : NC_001788.1 >> snpEff.config

Step 5: Build

java -jar snpEff.jar build NC_001788.1

Summary page: Add Ts/Tv per sample

Out of frame Start_Gained

Need better Kozak predictions
http://en.wikipedia.org/wiki/Kozak_consensus_sequence

About the snpsift and snpEff annotation of indel and multiple nucleotide polymorphisms

Hi pcingola,
When an indel in a vcf file overlaps a dbSNP record, I observe that .the indel will be annotated with the dbSNP record.
Do you require a 50% reciprocal overlap criteria?

How about multiple nucleotide polymorphism? Would you recommend me to decomplex the mnp into isolated snps before using snpEff or other way round?

test

Improve support for circular genomes

In a circular genome, an Exon with lower coordinates can be after an exon with higher coordinates.
This should be reflected in "transcript sort" algorithm

E.g.:
GenBank GQ861354): exon [97398..97628] is before exon [68375..68476]

 CDS complement(join(96834..96860,97398..97628,68375..68476))
                 /gene="rps12"
                 /trans_splicing
                 /codon_start=1
                 /transl_table=11
                 /product="ribosomal protein S12"
                 /protein_id="ACY66286.1"
                 /db_xref="GI:262400797"
                 /translation="MPTIKQLIRNTRQPIRNVTKSPALRGCPQRRGTCTITPKKPNSA
                 LRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYHIVRGTLDAVG
                 VKDRQQGRSQYGVKKPK"

Transcripts having exons in both plus and minus strand

Biological question: ... is it possible to have a transcript that has exons in BOTH positive AND negative strands? E.g. Transcript TR1 has exons Ex1, Ex2, Ex3 and, say Ex1 is in the positive strand while Ex2 and Ex3 are on the negative strand.

...looked at the FlyBase annotation for FBtr0084084 and they are claiming these are an example of trans-splicing from the other strand. I'm not sure I believe it but there is at least one mass-spec protein backing it up.

Papers:
http://www.ncbi.nlm.nih.gov/pubmed/15520256 [^]
http://www.ncbi.nlm.nih.gov/pubmed/20615941 [^]

Transcripts having exons in both plus and minus strand

It is a biologically plausible for transcripts to have exons in both directions (plus and minus strand)

Add promoters database?

"A promoter-level mammalian expression atlas", Nature, 2014-03

Gene names containing spacer cause problems

Sample VCF:
$ cat zzz.vcf
1 551124 . A G 318.2 PASS AC=12

Sample command line:
java -Xmx4g -jar snpEff.jar -v Zv9.74 zzz.vcf

Error:

java.lang.RuntimeException: No white-space, semi-colons, or equals-signs are permitted in INFO field. Name:"LOF" Value:"(ILDR2 (2 of 2)|ENSDARG00000096600|1|1.00)"
at ca.mcgill.mcb.pcingola.vcf.VcfEntry.addInfo(VcfEntry.java:154)
at ca.mcgill.mcb.pcingola.outputFormatter.VcfOutputFormatter.addInfo(VcfOutputFormatter.java:280)
at ca.mcgill.mcb.pcingola.outputFormatter.VcfOutputFormatter.toString(VcfOutputFormatter.java:391)
at ca.mcgill.mcb.pcingola.outputFormatter.OutputFormatter.endSection(OutputFormatter.java:111)
at ca.mcgill.mcb.pcingola.outputFormatter.VcfOutputFormatter.endSection(VcfOutputFormatter.java:327)
at ca.mcgill.mcb.pcingola.outputFormatter.OutputFormatter.printSection(OutputFormatter.java:144)
at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEffCmdEff.iterateVcf(SnpEffCmdEff.java:346)
at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEffCmdEff.runAnalysis(SnpEffCmdEff.java:791)
at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEffCmdEff.run(SnpEffCmdEff.java:711)
at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEffCmdEff.run(SnpEffCmdEff.java:663)
at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEff.run(SnpEff.java:734)
at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEff.main(SnpEff.java:123)

Hom/Het calculations for multiple samples in VCF files

Now Hom/Het calculations work only on single sample VCF files.
I should extend this to multiple samples.