Coder Social home page Coder Social logo

snpeff's People

Contributors

alanhoyle avatar avakel avatar bpow avatar chapmanb avatar emreatbina avatar heuermh avatar kurt-mueller-osumc avatar lix1993 avatar mbourgey avatar nh13 avatar pcingola avatar polinabevad avatar tayabsoomro avatar tuomastik avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

snpeff's Issues

Build: Set transcript's protein_coding if there is a CDS record in the GFF/GTF file

Instead of assuming that all transcripts are non-coding by default, we could set as protein coding if the transcript has a CDS.

There might be problems with this approach. For
instance, in the human genome:

$ zcat genes.gtf.gz | grep -w CDS | cut -f 2 | ~/snpEff/scripts/uniqCount.pl
113 IG_C_gene
64 IG_D_gene
24 IG_J_gene
366 IG_V_gene
21 TR_C_gene
3 TR_D_gene
82 TR_J_gene
296 TR_V_gene
461 non_stop_decay
57770 nonsense_mediated_decay
773 polymorphic_pseudogene
731883 protein_coding

So there are 57K nonsense_mediated_decay transcripts that have CDSs, but are assumed not to be coding. As a workaround, we could add this only in cases
where the biotype is unknown (it's a better guess than assuming they are non-

INTRAGENIC is added even when a transcript is hit

For this input:

chr1    11017092    C   T

The ouput is:

chr1    11017092    C   T   .   .   .   EFF=EXON(MODIFIER||||774|C1orf127|protein_coding|CODING|ENST00000520253|7|1|WARNING_TRANSCRIPT_NO_START_CODON+WARNING_REF_DOES_NOT_MATCH_GENOME),EXON(MODIFIER||||823|C1orf127|protein_coding|CODING|ENST00000377004|8|1|WARNING_REF_DOES_NOT_MATCH_GENOME),INTRAGENIC(MODIFIER|||||C1orf127||CODING|||1),INTRON(MODIFIER||||656|C1orf127|protein_coding|CODING|ENST00000377008|7|1),INTRON(MODIFIER||||657|C1orf127|protein_coding|CODING|ENST00000418570|3|1|WARNING_TRANSCRIPT_NO_START_CODON)

Notice that even if 4 transcripts are hit for this gene, a INTRAGENIC result is added.

In Gene.java line 224, hitTranscript is overwritten for each transcript in the gene,
so if the last transcript is not hit hitTranscript will be false when we exit the loop and because of that an INTRAGENIC effect is added.

So as far as I understood, maybe :

hitTranscript = tr.seqChangeEffect(seqChange, changeEffects);

has to be changed to

hitTranscript |= tr.seqChangeEffect(seqChange, changeEffects);

but maybe I'm missing something :)

Julien

Mixed variants issues

Looks like we are choking on mixed variants:

chrX 134555866 . CGT AGCT
chrX 152864607 . GCGTG GCGCTG,GCGCTA

Select a few transcript variants without having to pass a full list of all transcript variants

Hi Pablo,
One of our scientists noticed that the effect annotation for BRCA1, when using canonical, is not the transcript variant most often considered to be "standard" (at least according to some!) He'd like to use NM_007294.3 and I found a way to pass this to snpEff through the -onlyTr option. However, I would effectively have to provide all transcript variants for all genes, otherwise all other genes are annotated as being INTRAGENIC.

Is there any way snpEff could select the longest transcript in all other cases except for the transcripts provided in the text file?

Updates to documentation

Just wondering about what appears to be a new annotation, SPLICE_SITE_REGION, which is currently given a LOW impact. I am assuming this is +/- a certain number of base pairs of known splice site donor/acceptor positions. A quick update to the documentation with the provenance and details of the annotation would be appreciated. Thanks!

Build: Add transcript filed when we checked against CDS and PROTEIN sequences.

Add transcript filed when we checked against CDS and PROTEIN sequences.
Then we can emit popper warnings if the sequence does not check.

Check a transcript using CDS and PROTEIN.
These values have to be serialized (saved).

When using a transcript:

i) If both checks are OK, then we are relatively confident that the Reference Genome
annotations are OK.

ii) If it doesn't check: Add a warning.

iii) If no checking was performed: Show an overall warning.

Phase correction may produce exons with end < start

This may be the case reported in Maize genome, transcript AC208892.3_FGT005
which leads to errors when creating the summary (genes) file.

Caused by: java.lang.RuntimeException: Interval error: end before start. Start:216323401, End: 216323400
at ca.mcgill.mcb.pcingola.interval.Interval.(Interval.java:32)
at ca.mcgill.mcb.pcingola.interval.Marker.(Marker.java:31)
at ca.mcgill.mcb.pcingola.interval.Markers.merge(Markers.java:221)
at ca.mcgill.mcb.pcingola.interval.Gene.sizeof(Gene.java:385)
at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)

After debugging, this is produced by:

start: 216323401 end:216323400
2:21632 3452-216323702 'Exon_2_216323452_216323703', rank: 3, frame: 2, sequence: aggagagggaggtggtggtggagcaccagcaggaggagccgaggaggagggccccg
2:21632 3401-216323400 'Exon_2_216323400_216323401', rank: 2, frame: 2, sequence:
2:21632 3283-216323328 'Exon_2_216323284_216323329', rank: 1, frame: 0, sequence: atggcggaggaccagaccaaccccagcggcccagccccagcaagcg
2:21632 3945-216324424 'Exon_2_216323946_216324425', rank: 4, frame: 0, sequence: caagggtttatttcgaggaatcaaattaacaacgatgtagtaacaggtgcacgagg

Add database compatibility check

Many people update the software but forget to update the database.
SnpEff should show a simple and clear error message to avoid confusion.

HGVS complianace: Transcript version

User request:

Can you guys output the version of transcript .Example for transcript ENST00000560659 with version it would be ENST00000560659.3 (based on Ensemble core GrCh75.37).The version number is must for the tool to be HGVS complaint.

HGVS: Frame shift

From user:
... I was wondering how difficult it would be to get SNPEFF to report the effects of frameshifts according to the "nomenclature for the description of sequence variants?"

For example, right now for a specific mutation at chr9:139811029 where we have a deletion GT -> G, the amino acid change gets annotated with -214. But this causes a frameshift, and should technically be reported as:

V214Gfs*14

"N" in ALT being changed to A,C,G,T - undesirable

I'm having a problem where some of the records in my VCF files have a single N in the ALT field. SnpEff/SnpSift keeps changing these to A,C,G,Ts. Is there any way to turn this feature off in the future? Or for that matter, a feature that makes it so that SnpEff doesn't change anything in the original VCF record besides adding a INFO field.

Example: original VCF record, VCF record after SnpEff

1 123456    rs123  NTGTATT  N

1 123456    rs123  NTGTATT  A,C,G,T

Download genome if not available

By default a genome should be downloaded if it is not available.
Add command line option "-nodownload" to override default behaviour.

Contigs and scaffolds names (both numbers) are mixed

From user:
... I am working with Bursaphelenchus xylophilus genome (nematode). The genome consists of contigs and scaffolds, some of them with the same number. SnpEff mixed scaffold with contig if they have the same number, if not, it works fine,....Is it a way to resolve this problem?

Summary plot problems: Variation graph is not showing the bar of Exons

User email:
Program runs well for al these 3 files but there is only 1 problem that in snpEff_summary file for the File 3, For the "Number of effects by type and region" portion the Variation graph is not showing the bar of Exons which have 50.226% value. Although its not a big deal and we can generate the graph by our self but I want to know the possible reasons for this.

NOTE: Unfortunately the user doesn't seem keen on providing data to replicate error conditions.

Script to download and build genome (NCBI)

The script is invoked (from SnpEff's directory) using the NCBI's ID as parameter

./scripts/buildNcbiDatabase.pl 'NC_001788.1'

Step 1: NCBI's page is downloaded in order to scrape the UID

curl http://www.ncbi.nlm.nih.gov/nuccore/NC_001788.1 > NC_001788.1

Step 2: Scrapte UID

...meta name="ncbi_uidlist" content="5835345"...

Step 3: Download GenBank file

curl "http://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?sendto=on&val=5835345" > NC_001788.1.gbk

Step 4: Add config, build db

echo NC_001788.1.genome : NC_001788.1 >> snpEff.config

Step 5: Build

java -jar snpEff.jar build NC_001788.1

Improve support for circular genomes

In a circular genome, an Exon with lower coordinates can be after an exon with higher coordinates.
This should be reflected in "transcript sort" algorithm

E.g.:
GenBank GQ861354): exon [97398..97628] is before exon [68375..68476]

 CDS complement(join(96834..96860,97398..97628,68375..68476))
                 /gene="rps12"
                 /trans_splicing
                 /codon_start=1
                 /transl_table=11
                 /product="ribosomal protein S12"
                 /protein_id="ACY66286.1"
                 /db_xref="GI:262400797"
                 /translation="MPTIKQLIRNTRQPIRNVTKSPALRGCPQRRGTCTITPKKPNSA
                 LRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYHIVRGTLDAVG
                 VKDRQQGRSQYGVKKPK"

Transcripts having exons in both plus and minus strand

Biological question: ... is it possible to have a transcript that has exons in BOTH positive AND negative strands? E.g. Transcript TR1 has exons Ex1, Ex2, Ex3 and, say Ex1 is in the positive strand while Ex2 and Ex3 are on the negative strand.

...looked at the FlyBase annotation for FBtr0084084 and they are claiming these are an example of trans-splicing from the other strand. I'm not sure I believe it but there is at least one mass-spec protein backing it up.

Papers:
http://www.ncbi.nlm.nih.gov/pubmed/15520256 [^]
http://www.ncbi.nlm.nih.gov/pubmed/20615941 [^]

Gene names containing spacer cause problems

Sample VCF:
$ cat zzz.vcf
1 551124 . A G 318.2 PASS AC=12

Sample command line:
java -Xmx4g -jar snpEff.jar -v Zv9.74 zzz.vcf

Error:

java.lang.RuntimeException: No white-space, semi-colons, or equals-signs are permitted in INFO field. Name:"LOF" Value:"(ILDR2 (2 of 2)|ENSDARG00000096600|1|1.00)"
at ca.mcgill.mcb.pcingola.vcf.VcfEntry.addInfo(VcfEntry.java:154)
at ca.mcgill.mcb.pcingola.outputFormatter.VcfOutputFormatter.addInfo(VcfOutputFormatter.java:280)
at ca.mcgill.mcb.pcingola.outputFormatter.VcfOutputFormatter.toString(VcfOutputFormatter.java:391)
at ca.mcgill.mcb.pcingola.outputFormatter.OutputFormatter.endSection(OutputFormatter.java:111)
at ca.mcgill.mcb.pcingola.outputFormatter.VcfOutputFormatter.endSection(VcfOutputFormatter.java:327)
at ca.mcgill.mcb.pcingola.outputFormatter.OutputFormatter.printSection(OutputFormatter.java:144)
at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEffCmdEff.iterateVcf(SnpEffCmdEff.java:346)
at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEffCmdEff.runAnalysis(SnpEffCmdEff.java:791)
at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEffCmdEff.run(SnpEffCmdEff.java:711)
at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEffCmdEff.run(SnpEffCmdEff.java:663)
at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEff.run(SnpEff.java:734)
at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEff.main(SnpEff.java:123)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.