suhrig / arriba Goto Github PK
View Code? Open in Web Editor NEWFast and accurate gene fusion detection from RNA-Seq data
License: Other
Fast and accurate gene fusion detection from RNA-Seq data
License: Other
The user guide advises the user to execute
source("https://bioconductor.org/biocLite.R")
biocLite("GenomicAlignments")
biocLite("GenomicRanges")
but about a year ago, this was stopped because of security issues. The correct way to install packages now is to install BiocManager from CRAN and then
library(BiocManager)
install("GenomicAlignments")
There is no need for the user to separately install GenomicRanges because that will automatically happen during the installation of GenomicAlignments because GenomicRanges is in the Depends field of the DESCRIPTION file of GenomicAlignments, so the install command will take care of it.
Also, there's nothing in the user guide which explains if there's a Google Groups site or how to contact the developer with questions.
Hi!
I am trying to install Arriba and on running make, it gives the error -
gcc -g -Wall -O2 -I. -Ihtslib/htslib -c -o cram/cram_io.o cram/cram_io.c cram/cram_io.c:61:10: fatal error: 'lzma.h' file not found
#include <lzma.h>
^~~~~~~~
1 error generated. make[1]: *** [cram/cram_io.o] Error 1 make: *** [htslib/libhts.a] Error 2
Do you have any idea what I could do to resolve this error? I was expecting that all files necessary for the installation would be contained within the arriba folder.
Thanks!
Dear Arriba,
Inspection of events using IGV needs two files Chimeric.out.sam and Aligned.out.bam using run_arriba.sh generates Aligned.out.bam only. Is it possible to get Chimeric.out.sam from Aligned.out.bam? or is there an option to get Chimeric.out.sam as well as Aligned.out.bam using the run_arriba.sh?
Hi,
I tried running Arriba using the following command -
/Users/chahat/Documents/DRG/STAR-master/bin/MacOSX_x86_64/STAR --runThreadN 3 --genomeDir /Users/chahat/Documents/DRG/STAR-master/genome --genomeLoad NoSharedMemory --readFilesIn /Volumes/bam/DRG/fastq_50/PhenoInfoAvailable/23T2L.fastq --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outSAMunmapped Within --outBAMcompression 0 --outFilterMultimapNmax 1 --outFilterMismatchNmax 3 --chimSegmentMin 10 --chimOutType WithinBAM SoftClip --chimJunctionOverhangMin 10 --chimScoreMin 1 --chimScoreDropMax 30 --chimScoreJunctionNonGTAG 0 --chimScoreSeparation 1 --alignSJstitchMismatchNmax 5 -1 5 5 --chimSegmentReadGapMax 3 | /Users/chahat/Downloads/arriba_v1.2.0/arriba -x /dev/stdin -o fusions.tsv -O fusions.discarded.tsv -a /Volumes/bam/DRG/annotations/hg38.fa -g /Volumes/bam/DRG/annotations/gencode.v32.annotation.gtf.gz -b /Users/chahat/Downloads/arriba_v1.2.0/database/blacklist_hg38_GRCh38_2018-11-04.tsv.gz -T -P
Where, I downloaded the latest hg38 fasta file from http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/latest/hg38.fa.gz and the latest Gencode GTF annotation from ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.annotation.gtf.gz
On running, I got -
[2020-01-15T19:26:31] Loading annotation from '/Volumes/bam/DRG/annotations/gencode.v32.annotation.gtf.gz'
[2020-01-15T19:27:11] Loading assembly from '/Volumes/bam/DRG/annotations/hg38.fa'
ERROR: could not find sequence of contig 'Y'
Any ideas why this error could be occuring?
Hi Sebastian,
I really like your tool arriba, it's really fast.
But recently I encountered this weird issue, what's the possible reason for it ? is it because some GTF records ? very similar splicing junction structures ? or anything else ...
Thanks.
When I try to compile Arriba
with the release
flag I get the following error:
$ make release
...
make LIBS_SO="" LIBS_A="htslib-1.8/libhts.a " CPPFLAGS="-DHAVE_LIBDEFLATE -Ihtslib-1.8/htslib -I../static_libs_centos6.9"
make[1]: Entering directory '/dev/shm/arriba_v1.1.0'
make -C htslib-1.8 CPPFLAGS="-DHAVE_LIBDEFLATE -Ihtslib-1.8/htslib -I../static_libs_centos6.9" LDFLAGS="" libhts.a
make[2]: Entering directory '/dev/shm/arriba_v1.1.0/htslib-1.8'
gcc -g -Wall -O2 -I. -DHAVE_LIBDEFLATE -Ihtslib-1.8/htslib -I../static_libs_centos6.9 -c -o bgzf.o bgzf.c
bgzf.c:39:10: fatal error: libdeflate.h: No such file or directory
#include <libdeflate.h>
^~~~~~~~~~~~~~
compilation terminated.
Makefile:121: recipe for target 'bgzf.o' failed
make[2]: *** [bgzf.o] Error 1
make[2]: Leaving directory '/dev/shm/arriba_v1.1.0/htslib-1.8'
Makefile:27: recipe for target 'htslib-1.8/libhts.a' failed
make[1]: *** [htslib-1.8/libhts.a] Error 2
make[1]: Leaving directory '/dev/shm/arriba_v1.1.0'
Makefile:34: recipe for target 'release' failed
make: *** [release] Error 2
Is the htslib-1.8
folder supposed to include the libdeflate
library or how is the release
flag supposed to work? And can you document this in the installation instructions? Also what is supposed to be in STATIC_LIBS
or static_libs_centos6.9
?
Hi,
I have read in your home page that as of the 4th round of the SMC-RNA challenge, Arriba has topped the leader board. Can you please share the results from that competition ? This will help me not to duplicate efforts in bench marking this tool against others.
Thanks,
RV
Hi,
what would be the best way to cite Arriba? A biorXiv preprint would be excellent! Thanks,
Philip
Hi Suhrig,
It looks like the release tarball has the .git directory included in it, is that intentional?
arriba_v1.1.0/.git/objects/82/cd1803664f28f864d8e84e5e65e86c25eff653
arriba_v1.1.0/.git/objects/82/4b29862c4e75ac77ccedaaa7c37d7c2e504f5f
arriba_v1.1.0/.git/objects/82/8dd1cff22901bff9b303d5164ac578e3c2b9ce
arriba_v1.1.0/.git/objects/74/
arriba_v1.1.0/.git/objects/74/60c082bbf9c511783691f0405ce57dd90ca2d9
etc.
Full Error:
dags@bio:~/Desktop/FUSIONS/arriba_v1.1.0$ ./download_references.sh hg19+GENCODE19
Downloading assembly: http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/chromFa.tar.gz
Downloading annotation: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
Jul 30 10:43:59 ..... started STAR run
Jul 30 10:43:59 ... starting to generate Genome files
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
./download_references.sh: line 111: 7880 Aborted (core dumped) STAR --runMode genomeGenerate --genomeDir STAR_index_${ASSEMBLY}_${ANNOTATION} --genomeFastaFiles "$ASSEMBLY.fa" --sjdbGTFfile "$ANNOTATION.gtf" --runThreadN "$THREADS" --sjdbOverhang 200
Hi suhrig,
First, thank you so much for making such a wonderful tool, fast and sensitive.
Secondly, I have a suggestion in the column of read_identifiers in the result file. In this column, could you please separate IDs of split-reads from IDs of discordant-reads? so save them in two separate columns, instead of in a single column. Also the IDs of duplicates and mismatches are better to list separately since they are not real supporting evidence.
Thank you.
Z
Writing fusions to file 'fusions.tsv'
*** Error in `arriba': free(): invalid size: 0x00002aab3eab6010 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7c503)[0x2aaaab784503]
arriba[0x409589]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaab729b35]
arriba[0x405779]
======= Memory map: ========
00400000-00515000 r-xp 00000000 00:2e 3054027100 /research/rgs01/home/clusterHome/edavis5/.local/bin/arriba
00715000-00717000 rw-p 00115000 00:2e 3054027100 /research/rgs01/home/clusterHome/edavis5/.local/bin/arriba
00717000-75b4b5000 rw-p 00000000 00:00 0 [heap]
2aaaaaaab000-2aaaaaacb000 r-xp 00000000 08:02 204877411 /usr/lib64/ld-2.17.so
2aaaaaacb000-2aaaaaacd000 r-xp 00000000 00:00 0 [vdso]
2aaaaaacd000-2aaaaaacf000 rw-p 00000000 00:00 0
2aaaaaae5000-2aaaaaaeb000 rw-p 00000000 00:00 0
2aaaaacca000-2aaaaaccb000 r--p 0001f000 08:02 204877411 /usr/lib64/ld-2.17.so
2aaaaaccb000-2aaaaaccc000 rw-p 00020000 08:02 204877411 /usr/lib64/ld-2.17.so
2aaaaaccc000-2aaaaaccd000 rw-p 00000000 00:00 0
2aaaaaccd000-2aaaaadb6000 r-xp 00000000 08:02 206003601 /usr/lib64/libstdc++.so.6.0.19
2aaaaadb6000-2aaaaafb5000 ---p 000e9000 08:02 206003601 /usr/lib64/libstdc++.so.6.0.19
2aaaaafb5000-2aaaaafbd000 r--p 000e8000 08:02 206003601 /usr/lib64/libstdc++.so.6.0.19
2aaaaafbd000-2aaaaafbf000 rw-p 000f0000 08:02 206003601 /usr/lib64/libstdc++.so.6.0.19
2aaaaafbf000-2aaaaafd4000 rw-p 00000000 00:00 0
2aaaaafd4000-2aaaab0d4000 r-xp 00000000 08:02 205237963 /usr/lib64/libm-2.17.so
2aaaab0d4000-2aaaab2d4000 ---p 00100000 08:02 205237963 /usr/lib64/libm-2.17.so
2aaaab2d4000-2aaaab2d5000 r--p 00100000 08:02 205237963 /usr/lib64/libm-2.17.so
2aaaab2d5000-2aaaab2d6000 rw-p 00101000 08:02 205237963 /usr/lib64/libm-2.17.so
2aaaab2d6000-2aaaab2eb000 r-xp 00000000 08:02 206003591 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
2aaaab2eb000-2aaaab4ea000 ---p 00015000 08:02 206003591 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
2aaaab4ea000-2aaaab4eb000 r--p 00014000 08:02 206003591 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
2aaaab4eb000-2aaaab4ec000 rw-p 00015000 08:02 206003591 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
2aaaab4ec000-2aaaab503000 r-xp 00000000 08:02 205304005 /usr/lib64/libpthread-2.17.so
2aaaab503000-2aaaab702000 ---p 00017000 08:02 205304005 /usr/lib64/libpthread-2.17.so
2aaaab702000-2aaaab703000 r--p 00016000 08:02 205304005 /usr/lib64/libpthread-2.17.so
2aaaab703000-2aaaab704000 rw-p 00017000 08:02 205304005 /usr/lib64/libpthread-2.17.so
2aaaab704000-2aaaab708000 rw-p 00000000 00:00 0
2aaaab708000-2aaaab8be000 r-xp 00000000 08:02 204940282 /usr/lib64/libc-2.17.so
2aaaab8be000-2aaaababe000 ---p 001b6000 08:02 204940282 /usr/lib64/libc-2.17.so
2aaaababe000-2aaaabac2000 r--p 001b6000 08:02 204940282 /usr/lib64/libc-2.17.so
2aaaabac2000-2aaaabac4000 rw-p 001ba000 08:02 204940282 /usr/lib64/libc-2.17.so
2aaaabac4000-2aaaae0a1000 rw-p 00000000 00:00 0
2aaab00a1000-2aab3c0a1000 rw-p 00000000 00:00 0
2aab3eab6000-2aab41e84000 rw-p 00000000 00:00 0
2aab440a1000-2aaba40a1000 rw-p 00000000 00:00 0
2aabc44a8000-2aabe79be000 rw-p 00000000 00:00 0
2aacac0a1000-2aaccc0a1000 rw-p 00000000 00:00 0
7ffffffdc000-7ffffffff000 rw-p 00000000 00:00 0 [stack]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
/lsf_jobspool/1559613367.79853720: line 8: 17716 Aborted (core dumped) arriba -c Chimeric.out.sam -x Aligned.sortedByCoord.out.bam -g /home/edavis5/arriba_v1.1.0/RefSeq_hg19.gtf -o fusions.tsv -b /home/edavis5/arriba_v1.1.0/database/blacklist_hg19_hs37d5_GRCh37_2018-11-04.tsv.gz -a /home/edavis5/arriba_v1.1.0/hg19.fa
The script you provide for downloading the references + annotations assumes a read length of 201bp.
Specifically the final stage of the script builds a STAR index with:
STAR --runMode genomeGenerate --genomeDir STAR_index_${ASSEMBLY}_${ANNOTATION} --genomeFastaFiles "$ASSEMBLY.fa" --sjdbGTFfile "$ANNOTATION.gtf" --runThreadN "$THREADS" --sjdbOverhang 200
However if we consult the STAR documentation note that the -sjdbOverhang
for index generation is defined as:
Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads
Consequently would hardwiring this to a fixed value of 200 be unwise? For instance in my case I have 76bp reads, so my optimum value of -sjdbOverhang
would be 75 assuming the STAR recommendations apply?
Hello,
I am curious about the strand1 and strand2 columns in arriba output. I read in the documentation that the strand before "/" is the strand of the gene from the gtf and the second strand (that is after "/") is from the assembled fusion.
However I'm not sure how arriba assembles the fusion supporting reads to get the strand information from the STAR BAM file. Can you please explain?
Thanks!
Hi to arriba developers,
I tried using arriba to filter for fusion transcripts in my samples between virus and host. I supplied a custom made GTF file for the virus as there are no public GTF available for the virus. However, I always receive a message saying WARNING: exon belongs to unknown gene with ID: HIV_vif
and it could not read in the virus gene annotations.
Could you advice on how I can fix the GTF file so I can read in the annotations? I have tried multiple times with various modifications to no avail.
Here I showed the lines for the first 3 genes of the virus annotation.
hiv-1_HXB2 RefSeq transcript 456 9636 0 + . transcript_id "vif_01"; gene_name "Vif"; gene_id "HIV_vif";
hiv-1_HXB2 RefSeq exon 456 742 0 + . transcript_id "vif_01"; gene_name "Vif"; gene_id "HIV_vif";
hiv-1_HXB2 RefSeq exon 4912 9636 0 + . transcript_id "vif_01"; gene_name "Vif"; gene_id "HIV_vif";
hiv-1_HXB2 RefSeq CDS 5041 5619 0 + . transcript_id "vif_01"; gene_name "Vif"; gene_id "HIV_vif";
hiv-1_HXB2 RefSeq transcript 456 9636 0 + . transcript_id "vpr_01"; gene_name "Vpr"; gene_id "HIV_vpr";
hiv-1_HXB2 RefSeq exon 456 742 0 + . transcript_id "vpr_01"; gene_name "Vpr"; gene_id "HIV_vpr";
hiv-1_HXB2 RefSeq exon 5389 9636 0 + . transcript_id "vpr_01"; gene_name "Vpr"; gene_id "HIV_vpr";
hiv-1_HXB2 RefSeq CDS 5559 5850 0 + . transcript_id "vpr_01"; gene_name "Vpr"; gene_id "HIV_vpr";
hiv-1_HXB2 RefSeq transcript 456 9636 0 + . transcript_id "vpu_01"; gene_name "Vpu"; gene_id "HIV_vpu";
hiv-1_HXB2 RefSeq exon 456 742 0 + . transcript_id "vpu_01"; gene_name "Vpu"; gene_id "HIV_vpu";
hiv-1_HXB2 RefSeq exon 5975 9636 0 + . transcript_id "vpu_01"; gene_name "Vpu"; gene_id "HIV_vpu";
hiv-1_HXB2 RefSeq CDS 6062 6310 0 + . transcript_id "vpu_01"; gene_name "Vpu"; gene_id "HIV_vpu";
Hope to hear back from you soon. Thank you very much.
I have a data set which is pairs of reads, each read being 150 bases long. The median insert size is 140 bases, so more than half of the read pairs for a sample overlap each other completely (I have done essential adapter trimming with cutadapt to remove TruSeq adapter from reads ends). It seems like arriba would have problems with this data set:
... and three alignments for split reads (alignments of the first and second read and a supplementary alignment of the clipped segment) ... Fragments with too few or too many alignments are removed.
If two reads that completely overlap each other have a split in them, the split would appear in both reads and so there would be four alignments and it sounds like they would be discarded by arriba. Is it so?
Also, I notice many chimeric read pairs at a low level all across many different genes. I read on SEQanswers that fragmenting the RNA to very short lengths causes short fragments to randomly ligate to each other. Is there a way for arriba to handle this 'background level' of chimeras?
Hi Sebastian,
May I ask if you have performed some comparisons with STAR-Fusion ?
Thanks.
Hi, just wanted to say how awesome this package is. I have found the results to be robust and comprehensive in terms of information provided (compared to any other package I've used).
I had quick question related to the multi-threading and whether there is any consideration around implementing it as flag option? Might speed up the processing time for those of using WGS/WES BAM files.
Thank you again!
Hello,
When reporting fusion genes, is there a way to output gene ID's as well as names.
I was trying to use htseq-count for counting reads mapped to each exon of fusion genes. I also use STAR-Fusion together with Arriba; STAR-Fusion reports both gene name with gene ID.
Thanks.
I am attempting to use the hg38 annotation from GENCODE to run arriba. However it gave me "ERROR: failed to parse GTF file, please consider using -G". I added the following: "-G gene_name=gene_id gene_id=gene_id" but I got "ERROR: Malformed GTF features: gene_id=gene_id." This is where I got the annotation file from: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_33/gencode.v33.annotation.gtf.gz
Hi,
First of all, thanks for this amazing tool, it has been very useful to predict some of the protein fusions of my samples.
I have used arriba before with a different batch of samples without any issue. However, the last sequencing samples I tried to run with Arriba gave me some warning/errors and the "fusions.tsv" file is empty.
The output:
Loading annotation from '/mnt/lustre/users/k/RNAseq_RD/ReferenceGenome/Homo_sapiens.GRCh38.77.gtf.gz'
Loading assembly from '/mnt/lustre/users/k/RNAseq_RD/ReferenceGenome/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz'
Reading chimeric alignments from '/dev/stdin' (total=86)
Filtering multi-mappers and single mates (remaining=86)
Detecting strandedness (no)
Annotating alignments
Filtering duplicates (remaining=75)
Filtering mates which do not map to interesting contigs (1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y) (remaining=74)
Estimating mate gap distributionWARNING: not enough chimeric reads to estimate mate gap distribution, using default values
Filtering read-through fragments with a distance <=10000bp (remaining=47)
Filtering inconsistently clipped mates (remaining=41)
Filtering breakpoints adjacent to homopolymers >=6nt (remaining=41)
Filtering fragments with small insert size (remaining=33)
Filtering alignments with long gaps (remaining=33)
Filtering fragments with both mates in the same gene (remaining=33)
Filtering fusions arising from hairpin structures (remaining=33)
Filtering reads with a mismatch p-value <=0.01 (remaining=31)
Filtering reads with low entropy (k-mer content >=60%) (remaining=29)
Finding fusions and counting supporting reads (total=27)
Merging adjacent fusion breakpoints (remaining=27)
Estimating expected number of fusions by random chance (e-value)
Filtering fusions with both breakpoints in adjacent non-coding/intergenic regions (remaining=27)
Filtering intragenic fusions with both breakpoints in exonic regions (remaining=27)
Filtering fusions with <2 supporting reads (remaining=5)
Filtering fusions with an e-value >=0.3 (remaining=5)
Filtering fusions with both breakpoints in intronic/intergenic regions (remaining=5)
Filtering PCR fusions between genes with an expression above the 99.8% quantile (remaining=5)
Searching for fusions with spliced split reads (remaining=5)
Selecting best breakpoints from genes with multiple breakpoints (remaining=3)
Searching for fusions with >=4 spliced events (remaining=3)
Filtering blacklisted fusions in '/users/k1470099/arriba_v1.1.0/database/blacklist_hg38_GRCh38_2018-11-04.tsv.gz' (remaining=1)
Filtering fusions with anchors <=23nt (remaining=1)
Filtering end-to-end fusions with low support (remaining=1)
Filtering fusions with no coverage around the breakpoints (remaining=1)
Indexing gene sequences
Filtering genes with >=30% identity (remaining=0)
Re-aligning chimeric reads to filter fusions with >=80% mis-mappers (remaining=0)
Selecting best breakpoints from genes with multiple breakpoints (remaining=0)
Searching for additional isoforms (remaining=0)
Assigning confidence scores to events
Writing fusions to file '/mnt/lustre/users/k/RNAseq_RD/output/fusions.tsv'
Writing discarded fusions to file '/mnt/lustre/users/k/RNAseq_RD/output/fusions.discarded.tsv'
Are there any parameters I should change? So far I have been using the default settings but as I said they work with previous samples.
The script I am running:
`#$ -S /bin/bash
#$ -o /mnt/lustre/users/k
#$ -e /mnt/lustre/users/k
#$ -l h_vmem=40G
module load bioinformatics/STAR/2.7.0f
STAR --runThreadN 8
--runMode alignReads
--genomeDir /mnt/lustre/users/k/RNAseq_RD/ReferenceGenome
--readFilesIn /mnt/lustre/users/k/RNAseq_RD/R1_001_1.fastq.gz /mnt/lustre/users/k1470099/RNAseq_RD/R2_001_2.fastq.gz --readFilesCommand zcat
--outStd BAM_Unsorted --outSAMtype BAM Unsorted --outSAMunmapped Within --outBAMcompression 0
--outFilterMultimapNmax 1 --outFilterMismatchNmax 3
--chimSegmentMin 10 --chimOutType WithinBAM SoftClip --chimJunctionOverhangMin 10 --chimScoreMin 1 --chimScoreDropMax 30 --chimScoreJunctionNonGTAG 0 --chimScoreSeparation 1 --alignSJstitchMismatchNmax 5 -1 5 5 --chimSegmentReadGapMax 3 --winBinNbits 15 |
/users/k/arriba_v1.1.0/./arriba -x /dev/stdin
-g /mnt/lustre/users/k/RNAseq_RD/ReferenceGenome/Homo_sapiens.GRCh38.77.gtf.gz -a /mnt/lustre/users/k/RNAseq_RD/ReferenceGenome/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz -b /users/k/arriba_v1.1.0/database/blacklist_hg38_GRCh38_2018-11-04.tsv.gz
-o /mnt/lustre/users/k/RNAseq_RD/output/tes_fusions.tsv -O /mnt/lustre/users/k/RNAseq_RD/output/tes_fusions.discarded.tsv `
Thanks in advance!
Hi, do you know if Arriba will work as intended with versions of STAR greater than 2.7.2a. I note in the release notes that with reference to this STAR release that behaviour with respect to Chimeric reads changed:
Chimeric read reporting now requires that the chimeric read alignment score higher than the alternative non-chimeric alignment to the reference genome. The Chimeric.out.junction file now includes the scores of the chimeric alignments and non-chimeric alternative alignments, in addition to the PEmerged bool attribute.
It's specifically the presence of scores for non-chimeric alternatives, I'm assuming this is a new feature of STAR? I don't see anything in the release notes for Arriba 1.2.0 about the changes in STAR and I'm cautious that alternations to Chimeric.out.junction format might result in unwanted behaviour or non-chimeric alternatives being parsed.
Hey everyone,
Congrats on the great showing in the DREAM challenge! I am looking at what it would take to integrate this in our workflow. I tried installing the package from bioconda, but the pinning of htslib to be 1.8 causes a bunch of other stuff to get removed that we install. Is the pinning to 1.8 necessary? If it is, do you think we could work to get it working against htslib 1.9?
Hi,
I am wondering what is the expected file format for -k known_fusions.tsv
I recently downloaded the know fusions from COSMIC Complete Fusion Export as recommanded in the Arriba guide. However, the CosmicFusionExport.tsv.gz
file has the following format, which I think is not what's expected by Arriba
Sample ID Sample name Primary site Site subtype 1 Site subtype 2 Site subtype 3 Primary histology Histology subtype 1 Histology subtype 2 Histology subtype 3 Fusion ID Translocation Name 5' Chromosome 5' Genome start from 5' Genome start to 5' Genome stop from 5' Genome stop to 5' Strand 3' Chromosome 3' Genome start from 3' Genome start to 3' Genome stop from 3' Genome stop to 3' Strand Fusion type Pubmed_PMID
749711 HCC1187 breast NS NS NS carcinoma ductal_carcinoma NS NS 665 ENST00000360863.10(RGS22):r.1_3555_ENST00000369518.1(SYCP1):r.2100_3452 8 99981937 99981937 100106116 100106116 - 1 114944339 114944339 114995367 114995367 + Inferred Breakpoint 20033038
749711 HCC1187 breast NS NS NS carcinoma ductal_carcinoma NS NS 665 ENST00000360863.10(RGS22):r.1_3555_ENST00000369518.1(SYCP1):r.2100_3452 8 99981937 99981937 100106116 100106116 - 1 114944339 114944339 114995367 114995367 + Observed mRNA 20033038
749711 HCC1187 breast NS NS NS carcinoma ductal_carcinoma NS NS 689 ENST00000324093.8(PLXND1):r.1_2864_ENST00000393238.7(TMCC1):r.918_5992 3 129574336 129574336 129606818 129606818 - 3 129647792 129647792 129671264 129671264 - Inferred Breakpoint 20033038
749711 HCC1187 breast NS NS NS carcinoma ductal_carcinoma NS NS 689 ENST00000324093.8(PLXND1):r.1_2864_ENST00000393238.7(TMCC1):r.918_5992 3 129574336 129574336 129606818 129606818 - 3 129647792 129647792 129671264 129671264 - Observed mRNA 20033038
749711 HCC1187 breast NS NS NS carcinoma ductal_carcinoma NS NS 695 ENST00000285518.10(AGPAT5):r.1_898_ENST00000344683.9(MCPH1):r.2529_8039 8 6708357 6708357 6741751 6741751 + 8 6642994 6642994 6648504 6648504 + Inferred Breakpoint 20033038
Sience Arriba is a command-line tool for the detection of gene fusions from RNA-Seq data, I want to know how to do with DNA-seq data. looking forward to hearing from you.
Hello Sebatian,
I am writing to you because I have started using arriba (beautiful tool BTW) and the output of the translation feature of Arriba (called by the -P parameter) is a bit unclear to me.
As per the documentation, If the fusion transcript contains an ellipsis (...), the sequence beyond the ellipsis is trimmed before translation, because the reading frame cannot be determined reliably. A very sensible choice, and for my output it is indeed the case for the part after the fusion. However, the output of the part before the fusion confuses me. As examples, here are two events I find:
CCCTTTGGACCTTTGgCACCAGGCTGGG___AAAAAAGAGTGGATTCAACAGACAGGGTTTACTTTGTGAATCATAACA...GGAAGATCCAAGAACTCAAGG|ATAAAAGATCTGCAGCTATGGAATTCTTCTCCATGACATTTTCACAGAACATACTACTTGTGATTTATATCATGTCCTTACTCAAG___GTAAGGAACTGCAAGTGATCAATATTGC
gives
EDPRTQg|*
And
AAAGAAGACTGGGCCTACAAAGAAGAAAGTGAAAGAACTGAGAATTTTGG...ATCGCATGCCATATGAAGACATAAGAAACGTTATTCTGGAGGTTAATGAAGACATGCTGAGTGAGGCTTTAATTCAG|ATATGTTTAAAGGGTAAGGTGCAC...CACAGCCTCTCACAGACAG___TATGGAAGATTTTTATCCAAATAAAAATCATGGCCCT
gives
RMPYEDIRNVILEVNEDMLSEALIQ|ICLKGKV
I gather that in those cases (ellipsis in the first part of the fusion), arriba only translates from the last ellipsis of the first transcript part onwards, but how so ? Do you translate a full reconstruction of the CDS from the assembly and the gtf, and then only take the appropriate part (which I'm guessing is the case from the doc: Translation starts [...] when the start codon is encountered in the 5' gene. )?If so, how do you handle alternate start codons ? Does this translation method only happen when you are certain of the transcript start / transcript ID ?
Additionally, I have events looking like this:
GCGGATCTGGGGCCGTCCTCAG|CATAAGCTGTGGCCATGACTACTGAAGT...GGACTCTAGCCAGTTAGGAACAGATGCAACCAAGGAAAAACCTAAAGAAG
translated like that:
ADLGPSS|a*
Which I'm guessing is the other behavior of the translation tool described in the doc: Translation starts at the start of the assembled fusion transcript
My question here is then how do you choose which translation method to use ? Do you check whether the natural (annotated) 5' start codon is here and translate from it if you can, from the start of the transcript if not ? In addition, translation from the start of the transcript seems like it lacks some biological relevance - why not translate from the first encountered start codon ?
I'm sorry for the wall of text and the numerous questions but the translation output of arriba is the part I am most interested in, so I figured I would try to clarify my confusion.
Thanks in advance, and thanks for a great tool too !
Bruno
Hi,
When I tried to use "draw_fusions.R" to visualize arriba output, it reported "error : exon coordinates not found in gff3 ",so is there a way to get output results with "chr" prefix in exon coordinates ?
Thank you,
Xiucz
Hi,
Thanks a lot for providing this fantastic, well documented tool.
Would it be possible to add a seed option to enable deterministic behavior of the subsampling routine (-U)?
Also, I have repeatedly run a few samples (v.0.12) with identical settings and most differences can possibly be explained by subsampling differences. There is however one difference in confidence of a fusion with only 2 supported reads I cannot explain:
Run 1:
WDTC1 TBCEL +/+ +/+ 1:27561443 11:120957487 splice-site splice-site translocation downstream upstream 1 1 0 low . . mismatches(1) GCGCCCCCCcTCCCGGGAGAGGGGCCGCCCCCCCCGGACGGACATGGGCTCCTGAAGTTGCGCCGCTGCCGGTCGGGGGAAGAGACCTGACAG|GTATCATGAACTGATCACTAAATATGGGAAGTTGGAGCCTTTGGCAGAAGTGGACCTAAGACCCCAGAGCAGTGCAAAAGTAGAAGTCCACTTTAACGATCAGGTGGAAGAAATGAGCATTCGTCTGGACCAAACAGTGGCA .
Run 2:
WDTC1 TBCEL +/+ +/+ 1:27561443 11:120957487 splice-site splice-site translocation downstream upstream 1 1 0 medium . . mismatches(1) GCGCCCCCCcTCCCGGGAGAGGGGCCGCCCCCCCCGGACGGACATGGGCTCCTGAAGTTGCGCCGCTGCCGGTCGGGGGAAGAGACCTGACAG|GTATCATGAACTGATCACTAAATATGGGAAGTTGGAGCCTTTGGCAGAAGTGGACCTAAGACCCCAGAGCAGTGCAAAAGTAGAAGTCCACTTTAACGATCAGGTGGAAGAAATGAGCATTCGTCTGGACCAAACAGTGGCA .
Why is the same breakpoint once assigned with low- and once with medium confidence?
Is there a way to achieve completely reproducible results?
Many thanks!
Hi Sebastian,
We would like to investigate the effect of some large copy number gains (SV, CNV), but also screen some of our patients for unexpected events. Currently we align using Hisat2 mainly en use Kallisto for counting. I am not an bioinformatician myself, but have some support.
Sofar I understand both aligners do not give output compatible with Arriba. Do you know whether this understanding is correct? And are you thinking on generating support for other aligners or a file converter?
Best regards, Jasper Saris
Hello,
I met a problem with Arriba. Before using it on my samples I'm testing it and other tools on test dataset (positive, negative & real breast line cancer, paired-end data).
It appears that I got a lot of false positive for negative dataset made with Beers, I take it from Jaffa dataset, available here :
https://github.com/Oshlack/JAFFA/wiki/Download
For comparison, I run the analysis on the same data with FusionCatcher, Star-Fusion & Infusion. For these 3 tools I got 1, 8 and 38 false positive respectively while I got 196 fusions with Arriba ! I tried these parameters to improve my results but it's unsuccessful.
Max evalue to 0.05 (instead default 0.3) : 186 fusions.
Anchor Lenght to 40bp (instead default 23) : 196.
Any ideas to improve this ? Why does so many false positives I don't understand, I will have thought that changing anchor to 40 would have decrease drastically the number of fusions but It still the same...
Thank you in advance for your answer.
Dear arriba developers,
Is there an option to analyze multiple samples at a time?
Thank you,
Hello,
First, thanks for this tool, it's really quick and useful.
I'm working on a diagnostic call on fusions. I'm been asked to be the more sensitive even if I get false positive. In this aim, I use different caller ( StarFusion & FusionCatcher) . I was wondering if it was possible to use the tool Draw.fusion.r with other inputs ?
I know it won't be the same quality since there is a lot of information that you add in your fusion call that others don't.
Or maybe the other way around is to add a list of fusions already called little bit like the "-k" argument.
Best,
Would it be possible to wrap the tool into conda package and upload it to bioconda? I am working on a rnafusion pipeline which consists of multiple tools for fusion detection. It would be super nice to implement your tool to the stack 🎉
Hi Sebastian,
I am coming back to you after a few weeks of extended use and integration of arriba (still a great tool :) ). I am running it with the -T -P options.
In one of my samples, I had a series of results that I have been wondering about :
CSNK1D SUZ12P1 -/- +/+ 17:82251379 17:30734897 splice-site splice-site inversion upstream upstream 0 3 14 902 149 high . . duplicates(14),mismatches(1) GCTACCCTT___CCGAATTTGCCACATACCTGAATTTCTGCCGTTCCTTGCGTTTTGACGACAAGCCTGACTACTCGTACCTGCGGCAGCTTTTCCGGAATCTGTTCCATCGCCAGGGCTTCTCCTATGACTACGTGTTCGACTGGAACATGCTCAAATTT|AGCCAACACAGATCTATAGATTTCTTTGAACTCGGAATCTCATAGCA___CCAATATTTTTGCACAGAACTCTTACTTACATGTCTCATCGAAACTCCAGAACAAACATCAAAAG___GAA...AGCTTGTCAGCTCATTTGCAGCTTACATTTTTGGTTTCTT out-of-frame YPSEFATYLNFCRSLRFDDKPDYSYLRQLFRNLFHRQGFSYDYVFDWNMLKF|sqhrsidffelgis* .
CSNK1D SUZ12P1 -/- +/+ 17:82251379 17:30759491 splice-site splice-site inversion upstream upstream 0 1 4 902 278 medium . . duplicates(3),mismatches(1) CATACCTGAATTTCTGCCGTTCCTTGCGTTTTGACGACAAGCCTGACTACTCGTACCTGCGGCAGCTTTTCCG...CATCGCCAGGGCTTCTCCTATGACTACGTGTTCGACTGGAACATGCTCAAATTT|CTTGTCAGCTCATTTGCAGCTTACATTTTTGGTTTCTTCCACAAAAATG___ATAAGCCATCA...AAAATGAACAAAATTCTGTTACCCTGGAAGTCCTGCTTGTGAAAGTTTGC out-of-frame HRQGFSYDYVFDWNMLKF|lvssfaayifgffhkndkp .
CSNK1D SUZ12P1 -/- +/+ 17:82252434 17:30766503 splice-site splice-site inversion upstream upstream 0 1 0 1016 238 low . . duplicates(1) ATCGAAGTGTTGTGTAAAGGCTACCCTT|ATAAGCCATCACCAAACTCAGA...TCCAATAAGGCAAGTTCCCACAGGTAAAAAGCAGGTGCCTTTGAATCCTG out-of-frame IEVLCKGYP|ykpspns .
As you can see, arriba detected several fusions between the same pair of two genes (all of them at different splice sites of SUZ12P1). However, according to the best-select filter description, "If there are multiple breakpoints detected between the same pair of genes, this filter discards all but the most credible one." What is happening in this case ? Does another filter overrule the best-select ?
Thank you in advance for your time,
Best,
Bruno
I would like to display the circos plot with fusions labeled with maybe different colors for interchromosomal and intrachromosomal fusions.
Could you point me in the right direction? I'm trying to understand your r script so I can customize the circos plot if possible.
Hi! I am trying to use arriba, but wiith some errors:
Loading assembly from 'genome.fna'
ERROR: could not find sequence of contig '10'
That was my Arriba command-line:
arriba -c S9Chimeric.out.junction -x S9Aligned.sortedByCoord.out.bam -g genannotation.gtf -a genome.fna -f blacklist -o fusions.tsv -O fusions.discarded.tsv-d S9_sorted.vcf -s auto -V 0.1 -T -P -I
I was wondering if the problem would be with the chromosomes names into my files, then I tried to to stating them with -i options of contig but with no progress. Do you have any clue what is going on?
Thanks in Advance!
Hello,
I want to run Arriba directly (without STAR) because I already have the bam file of my rna-seq sample. When inputting the necessary files in the command line (/.arriba .bam file, .gtf file, .fa file, etc) , when I run the analysis each time it says that my GTF file is malformed - I have also tried the GTF file (assembly) from your script "download_references.sh" - which does not work either :
ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
BTW I'm running Arriba on Mac .
Thank you!
Marie
Most structural variant callers output results in VCF. To simplify the output of one program as the input to another, it would be desirable to specify a VCF using -d
.
Hi,
We have been observing consistently low evidence (split_reads) for fusions detected in cell lines/solid tumor samples as compared to when I use FusionCatcher or Pizzly. Is this expected or is there parameter tuning that can regulate this behavior?
Thanks,
Prateek
In my circos plot, some of the chromosome labels are sort of hidden due to the numerous intrachromosomal fusions, what would you suggest for making the chromosome label stick out?
Is it possible to make the lines connecting the gene label to the chromosomal band longer, or thinner to not hide the chromosome label? Perhaps a different color font for the chromosomal label?
So far I got this parameter to work:
circos.genomicLabels(geneLabels, labels.column=4, side="outside", cex=fontSize,connection_height = convert_height(20, "mm"))
But adding the other parameters give me 50 warnings
circos.genomicLabels(geneLabels, labels.column=4, side="outside", cex=fontSize,connection_height = convert_height(20, "mm"),line_col = par(col="gray"), line_lwd = par(lwd=0.8), line_lty = par(lty=4))
Hi,
thanks for the nice code and extensive documentation. Would it be possible for you to make extract_read-through_fusions
output an extra @PG
line into the resulting bam file? Thanks a lot!
Hi @suhrig,
When using the -i
option, passing a space-separated list only grabs the first one in the list:
arriba -x ${test_bam} -g ${test_gtf} -a ${test_fasta} -o ${sample}-withi-fusions.tsv -o ${sample}-withi-fusions.discarded.tsv -i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y
Filtering mates which do not map to interesting contigs (1) (remaining=73956)
arriba -x ${TEST_BAM} -g ${TEST_GTF} -a ${TEST_FASTA} -o ${SAMPLE}-withi-fusions.tsv -o ${SAMPLE}-withi-fusions.discarded.tsv -i 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y
Filtering mates which do not map to interesting contigs (1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y) (remaining=1543505)
The documentation says spaces or commas work, though.
Hi Sebastian,
I had a chat with you on arriba in the PhD poster presentation last month, and it impressed me a lot!
I made posts to briefly introduce the software in two popular Chinese bioinformatics forums:
Arriba I
Arriba II
Sorry, it is almost all Chinese...
I hope more Chinese researchers would know your software and prefer to use it!
GL to your DREAM SMC RNA Challenge contest,
Wenhu Cao
I git clone this software, but compile error (see below) was got when i run make command in Centos 6.
How to fix this make error?
$make
make -C source arriba
make[1]: Entering directory /share/Data01/liwujiao/biosoft/arriba/source' g++ -c -pthread -std=c++0x -O2 -w -I. -I.. -I../samtools-1.3 -I../samtools-1.3/htslib-1.3 annotation.cpp -lz g++ -c -pthread -std=c++0x -O2 -w -I. -I.. -I../samtools-1.3 -I../samtools-1.3/htslib-1.3 assembly.cpp -lz g++ -c -pthread -std=c++0x -O2 -w -I. -I.. -I../samtools-1.3 -I../samtools-1.3/htslib-1.3 options_arriba.cpp -lz options_arriba.cpp: In function 'void print_usage(const std::string&)': options_arriba.cpp:141: error: call of overloaded 'to_string(float&)' is ambiguous /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note: std::string std::to_string(long long unsigned int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note: std::string std::to_string(long double) options_arriba.cpp:144: error: call of overloaded 'to_string(unsigned int&)' is ambiguous /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note: std::string std::to_string(long long unsigned int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note: std::string std::to_string(long double) options_arriba.cpp:147: error: call of overloaded 'to_string(float&)' is ambiguous /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note: std::string std::to_string(long long unsigned int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note: std::string std::to_string(long double) options_arriba.cpp:150: error: call of overloaded 'to_string(float&)' is ambiguous /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note: std::string std::to_string(long long unsigned int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note: std::string std::to_string(long double) options_arriba.cpp:152: error: call of overloaded 'to_string(unsigned int&)' is ambiguous /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note: std::string std::to_string(long long unsigned int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note: std::string std::to_string(long double) options_arriba.cpp:159: error: call of overloaded 'to_string(unsigned int&)' is ambiguous /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note: std::string std::to_string(long long unsigned int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note: std::string std::to_string(long double) options_arriba.cpp:162: error: call of overloaded 'to_string(unsigned int&)' is ambiguous /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note: std::string std::to_string(long long unsigned int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note: std::string std::to_string(long double) options_arriba.cpp:167: error: call of overloaded 'to_string(unsigned int&)' is ambiguous /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note: std::string std::to_string(long long unsigned int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note: std::string std::to_string(long double) options_arriba.cpp:169: error: call of overloaded 'to_string(unsigned int&)' is ambiguous /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note: std::string std::to_string(long long unsigned int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note: std::string std::to_string(long double) options_arriba.cpp:172: error: call of overloaded 'to_string(float&)' is ambiguous /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note: std::string std::to_string(long long unsigned int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note: std::string std::to_string(long double) options_arriba.cpp:175: error: call of overloaded 'to_string(float&)' is ambiguous /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note: std::string std::to_string(long long unsigned int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note: std::string std::to_string(long double) options_arriba.cpp:180: error: call of overloaded 'to_string(unsigned int&)' is ambiguous /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note: std::string std::to_string(long long unsigned int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note: std::string std::to_string(long double) options_arriba.cpp:185: error: call of overloaded 'to_string(unsigned int&)' is ambiguous /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note: std::string std::to_string(long long unsigned int) /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note: std::string std::to_string(long double) make[1]: *** [options_arriba.o] Error 1 make[1]: Leaving directory
/some/dir/arriba/source'
make: *** [arriba] Error 2
Hello suhrig,
Thank you for developing the tool, it's very informative. However I had a couple of questions that I believe are not covered in the document, if it's already available please point me to it.
Thank you,
Krutika
HI,
I am trying to catchem' all with the FusionCatcher Dataset + Arriba:
https://github.com/ndaniel/fusioncatcher/tree/master/test
I tried with the following command line arguments:
STAR \
'--runMode alignReads' \
--alignIntronMax \
1000000 \
--alignIntronMin \
20 \
--alignMatesGapMax \
1000000 \
--alignSJDBoverhangMin \
1 \
--alignSJoverhangMin \
8 \
--chimJunctionOverhangMin \
15 \
--chimOutType \
WithinBAM \
SoftClip \
--chimSegmentMin \
15 \
--genomeDir \
/var/lib/cwl/stg12a41017-ebd1-4396-9fdb-ef251725790f/star \
--genomeLoad \
NoSharedMemory \
--limitBAMsortRAM \
60000000000 \
--limitOutSAMoneReadBytes \
90000000 \
--outFilterIntronMotifs \
RemoveNoncanonical \
--outFilterMismatchNmax \
10 \
--outFilterMismatchNoverLmax \
0.1 \
--outFilterMultimapNmax \
10 \
--outFilterType \
BySJout \
--outReadsUnmapped \
Fastx \
--outSAMmapqUnique \
255 \
--outSAMstrandField \
intronMotif \
--outSAMtype \
BAM \
SortedByCoordinate \
--outSAMunmapped \
Within \
--outSAMmode \
Full \
--readFilesCommand \
zcat \
--runThreadN \
8 \
--seedSearchStartLmax \
30 \
--readFilesIn \
/var/lib/cwl/stgda69f516-ac0d-4ab1-a638-7b3621f260fe/joinedfiles.dat \
/var/lib/cwl/stg0b23d764-be65-4426-85b3-c6a7ac523b44/joinedfiles.dat
arriba \
-g \
/var/lib/cwl/stgc586d5c5-6f47-4507-b3c8-32bc67801d2b/gencode.v32.annotation.gtf \
-a \
/var/lib/cwl/stg807396cf-c9a5-41d9-a791-8b4420370377/GRCh38.primary_assembly.genome.fa \
-b \
/var/lib/cwl/stga879b704-1ea1-4158-bc0a-761a72af1f02/blacklist_hg38_GRCh38_2018-11-04.tsv.gz \
-x \
/var/lib/cwl/stgb4d5992c-53d9-4c45-adee-e4de4f54027c/Aligned.sortedByCoord.out.bam \
-O \
fusions.discarded.tsv \
-o \
fusions.tsv \
-P \
-P
and got ten:
#gene1 gene2 strand1(gene/fusion) strand2(gene/fusion) breakpoint1 breakpo$
FGFR3 TACC3 +/+ +/+ 4:1806934 4:1727977 splice-site CDS $
FIP1L1 PDGFRA +/+ +/+ 4:53425965 4:54274925 splice-site CDS $
HOOK3 RET +/+ +/+ 8:42968214 10:43116584 splice-site splice-$
AKAP9 BRAF +/+ -/- 7:92003235 7:140787584 splice-site splice-$
EWSR1 ATF1 +/+ +/+ 22:29287134 12:50814280 splice-site splice-$
ETV6 NTRK3 +/+ -/- 12:11869969 15:87940753 splice-site splice-$
EML4 ALK +/+ -/- 2:42301394 2:29223584 splice-site 5'UTR $
BRD4 NUTM1 -/- +/+ 19:15254152 15:34347969 splice-site splice-$
GOPC ROS1 -/- -/- 6:117566854 6:117321394 splice-site splice-$
TMPRSS2 ETV1 -/- -/- 21:41494375 7:13935838 CDS CDS translo$
Then I tried changing MaxReads and MaxEValue which didnt increase the number of fusions, then tried disabling a few filters (min_support\many_spliced):
STAR \
'--runMode alignReads' \
--alignIntronMax \
1000000 \
--alignIntronMin \
20 \
--alignMatesGapMax \
1000000 \
--alignSJDBoverhangMin \
1 \
--alignSJoverhangMin \
8 \
--chimJunctionOverhangMin \
15 \
--chimOutType \
WithinBAM \
SoftClip \
--chimSegmentMin \
15 \
--genomeDir \
/var/lib/cwl/stgf7406dee-faff-40b1-9b06-303b961e77c7/star \
--genomeLoad \
NoSharedMemory \
--limitBAMsortRAM \
60000000000 \
--limitOutSAMoneReadBytes \
90000000 \
--outFilterIntronMotifs \
RemoveNoncanonical \
--outFilterMismatchNmax \
10 \
--outFilterMismatchNoverLmax \
0.1 \
--outFilterMultimapNmax \
10 \
--outFilterType \
BySJout \
--outReadsUnmapped \
Fastx \
--outSAMmapqUnique \
255 \
--outSAMstrandField \
intronMotif \
--outSAMtype \
BAM \
SortedByCoordinate \
--outSAMunmapped \
Within \
--outSAMmode \
Full \
--readFilesCommand \
zcat \
--runThreadN \
8 \
--seedSearchStartLmax \
30 \
--readFilesIn \
/var/lib/cwl/stgdb0d3990-f97d-40cc-9f55-c06d95ba9968/joinedfiles.dat \
/var/lib/cwl/stg89bb78b7-93ff-43f0-a1e7-214f45ec2bce/joinedfiles.dat
arriba \
-g \
/var/lib/cwl/stg4b736520-cb0e-4049-a556-69e167158000/gencode.v32.annotation.gtf \
-a \
/var/lib/cwl/stg28d39c0b-6b89-4638-b092-bbcd871a1bc7/GRCh38.primary_assembly.genome.fa \
-b \
/var/lib/cwl/stg82c7408b-af40-42df-971d-809ef1387304/blacklist_hg38_GRCh38_2018-11-04.tsv.gz \
-x \
/var/lib/cwl/stg6832edbb-f979-4f0d-9128-843c9e69cf6c/Aligned.sortedByCoord.out.bam \
-f \
min_support \
many_spliced \
-O \
fusions.discarded.tsv \
-o \
fusions.tsv \
-E \
1 \
-U \
50 \
-P \
-P
which gave me 15 fusions, but not the ones i was looking for and i dont think disabling the filters is the right way to expand the set of fusions:
FGFR3 TACC3 +/+ +/+ 4:1806934 4:1727977 splice-site CDS dupl$
FIP1L1 PDGFRA +/+ +/+ 4:53425965 4:54274925 splice-site CDS dele$
HOOK3 RET +/+ +/+ 8:42968214 10:43116584 splice-site splice-site $
AKAP9 BRAF +/+ -/- 7:92003235 7:140787584 splice-site splice-site $
EWSR1 ATF1 +/+ +/+ 22:29287134 12:50814280 splice-site splice-site $
ETV6 NTRK3 +/+ -/- 12:11869969 15:87940753 splice-site splice-site $
EML4 ALK +/+ -/- 2:42301394 2:29223584 splice-site 5'UTR inve$
BRD4 NUTM1 -/- +/+ 19:15254152 15:34347969 splice-site splice-site $
GOPC ROS1 -/- -/- 6:117566854 6:117321394 splice-site splice-site $
TMPRSS2 ETV1 -/- -/- 21:41494375 7:13935838 CDS CDS translocatio$
CD74 ROS1 -/- -/- 5:150404680 6:117324415 splice-site splice-site $
GNAS BRAF +/+ -/- 20:58909359 7:140783038 CDS CDS translocatio$
SEPTIN9 BRAF +/+ -/- 17:77499723 7:140753339 3'UTR CDS translocatio$
EWSR1 FLI1 +/+ +/+ 22:29287134 11:128807180 splice-site splice-site $
NTRK3 ATP2B1 -/- -/- 15:87880322 12:89630643 CDS CDS translocatio$
Any idea how I can catchem all?
FULL SET OF FUSIONS:
- FGFR3-TACC3 (short reads from [2]),
- FIP1L1-PDGFRA (short reads from [3]),
- GOPC-ROS1 (short reads from [4]),
- EWS-ATF1 (short reads from [1]),
- TMPRSS2-ETV1 (short reads from [1]),
- EWS-FLI1 (short reads from [1]),
- NTRK3-ETV6 (short reads from [1]),
- CD74-ROS1 (short reads from [1]),
- HOOK3-RET (short reads from [1]),
- EML4-ALK (short reads from [1]),
- AKAP9-BRAF (short reads from [1]),
- BRD4-NUT (short reads from [1]),
- MALT1-IGH (short reads from [5]),
- IGH-CRLF2 (short reads from [6]),
- DUX4-IGH (short reads from [7]),
- NPM1-ALK (short reads from [8]), and
- CIC-DUX4 (short reads from [9]).
Thanks,
-WaO
When I git clone
this repo, I don't get the database subdirectory with the hg19 blacklist that's present in the release version v0.11.0
. The README mentions the blacklist but doesn't include it in the example run, and doesn't describe its use. This is a bit confusing.
Hi no matter what I set -U
to e.g. 1000
I will still receive the runtime message:
Finding fusions and counting supporting reads (total=WARNING: Some fusions were subsampled, because they have more than 300 supporting reads 12129)
It would appear that whilst the parameter is not being passed to the relevant subsampling routine.
When next updating the user guide, can you add a paragraph about multi-mapping chimeras? It seems that setting --chimMultimapNmax
to a higher number will be possible in a couple of weeks when the next version of STAR is planned to be released.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.