Comments (17)
Something does not seem right about this sample. In fact, the situation looks a lot similar to your other issue #228. Since it's public data, can you point me to the source from where you obtained it? Then I can take a look at it myself.
from arriba.
Thank you for your reply.
After receiving data access approval from EGA, I downloaded the RNA data.
It was in BAM file format, which I converted into a FASTQ file, then used STAR-Align followed by Arriba.
The dataset ID is EGAD00001006268 = EGAS00001004289.
Among the dataset, I tried EGAF00004171085 (file name: 01-15563_T1.rna.bam) and EGAF00004171086 (file name: 01-17618_T1.rna.bam) etc.
(If access to the restricted public data is difficult, Is it okay to explain the situation to the data owner and seek permission?)
Thanks a lot.
Sincerely,
DJ. J
from arriba.
Apologies for the slow response - it has been a busy week.
I didn't realize it's restricted data from EGA. I doubt it will be possible to share the data with me. So we will need to resort to remote troubleshooting.
Can you run samtools flagstat
on the problematic BAM file, please? Also, do you have the STAR Log.final.out
file and could share it with me?
Looking at the status messages from Arriba, about 54 million reads were considered as chimeric. That's almost all reads. This could happen when read1 and read2 were accidentally swapped when given as input to STAR. Can you rule this out?
from arriba.
I am very sorry for the late response due to personal reasons. I have tried running samtools flagstat and the outcome is as follows:
108111052 + 6633048 in total (QC-passed reads + QC-failed reads)
108111052 + 6633048 primary
0 + 0 secondary
0 + 0 supplementary
8903170 + 805612 duplicates
8903170 + 805612 primary duplicates
107740389 + 5690274 mapped (99.66% : 85.79%)
107740389 + 5690274 primary mapped (99.66% : 85.79%)
108111052 + 6633048 paired in sequencing
54055526 + 3316524 read1
54055526 + 3316524 read2
92916680 + 4440744 properly paired (85.95% : 66.95%)
107438656 + 5331394 with itself and mate mapped
301733 + 358880 singletons (0.28% : 5.41%)
10155368 + 676362 with mate mapped to a different chr
1489734 + 157793 with mate mapped to a different chr (mapQ>=5)
The STAR log.file.out could not be attached here, so I have attached it via email.
Thanks!
from arriba.
from arriba.
Unfortunately, the log file was also dropped by mail. Can you just copy-paste the content as a reply to this thread?
The flagstats already look weird. The are no supplementary alignments. Can you show me the STAR command, please?
from arriba.
Thank you for your reply.
The STAR_log.final.out file result is as follows:
Started job on | Jan 22 19:51:31
Started mapping on | Jan 22 19:51:44
Finished on | Jan 22 19:58:08
Mapping speed, Million of reads per hour | 537.86
Number of input reads | 57372050
Average input read length | 150
UNIQUE READS:
Uniquely mapped reads number | 4053986
Uniquely mapped reads % | 7.07%
Average mapped length | 148.53
Number of splices: Total | 593449
Number of splices: Annotated (sjdb) | 576252
Number of splices: GT/AG | 577965
Number of splices: GC/AG | 7210
Number of splices: AT/AC | 256
Number of splices: Non-canonical | 8018
Mismatch rate per base, % | 0.51%
Deletion rate per base | 0.01%
Deletion average length | 1.33
Insertion rate per base | 0.03%
Insertion average length | 1.33
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 3707714
% of reads mapped to multiple loci | 6.46%
Number of reads mapped to too many loci | 73616
% of reads mapped to too many loci | 0.13%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 49489970
% of reads unmapped: too short | 86.26%
Number of reads unmapped: other | 39041
% of reads unmapped: other | 0.07%
CHIMERIC READS:
Number of chimeric reads | 40096610
% of chimeric reads | 69.89%
The STAR command is as follows:
Convert BAM to FASTQ
/Users/dajeong/samtools-1.18/samtools fastq /Users/dajeong/STAR-Fusion/data/01-15563_T1.rna.bam -1 /Users/dajeong/STAR-Fusion/data/01-15563_T1_R1_fastq.gz -2 /Users/dajeong/STAR-Fusion/data/01-15563_T1_R2_fastq.gz
Unzip FASTQ files
gunzip -k /Users/dajeong/STAR-Fusion/data01-15563_T1_R1_fastq
gunzip -k /Users/dajeong/STAR-Fusion/data/01-15563_T1_R2_fastq
Run STAR alignment
/Users/dajeong/STAR/bin/MacOSX_x86_64/STAR --runThreadN 5
--genomeDir /Users/dajeong/Arriba_DJ/STAR_index_hg38_GENCODE38
--readFilesIn /Users/dajeong/STAR-Fusion/data/01-15563_T1_R1_fastq /Users/dajeong/STAR-Fusion/data/01-15563_T1_R2_fastq
--outFileNamePrefix /Users/dajeong/STAR-Fusion/output/01-15563_T1_
--chimSegmentMin 20
--chimOutType WithinBAM
--alignMatesGapMax 100000
--alignIntronMax 100000
--chimJunctionOverhangMin 20
--outSAMtype BAM Unsorted
Rename the output file
mv /Users/dajeong/STAR-Fusion/output/A37987_Aligned.out.bam /Users/dajeong/STAR-Fusion/output/A37987_STAR_Aligned.out.bam
Run Arriba for fusion detection
/Users/dajeong/arriba_v2.4.0/arriba -x /Users/dajeong/STAR-Fusion/output/01-15563_T1_STAR_Aligned.out.bam -g /Users/dajeong/Arriba_DJ/GENCODE38.gtf -a /Users/dajeong/Arriba_DJ/hg38.fa
-b /Users/dajeong/arriba_v2.4.0/database/blacklist_hg38_GRCh38_v2.4.0.tsv.gz -k /Users/dajeong/arriba_v2.4.0/database/known_fusions_hg38_GRCh38_v2.4.0.tsv.gz -p /Users/dajeong/arriba_v2.4.0/database/protein_domains_hg38_GRCh38_v2.4.0.gff3
-o /Users/dajeong/Arriba_DJ/output/01-15563_T1_fusions.tsv
Thanks :)
from arriba.
The problem could be that you haven't collated the paired-end mates before converting to FastQ. The mates will be mixed up if you don't do this.
Would you please try the following command for the conversion?
samtools collate -f -O -u -r 1000000 /Users/dajeong/STAR-Fusion/data/01-15563_T1.rna.bam |
samtools fastq -0 other.fastq -1 read1.fastq -2 read2.fastq -s singletons.fastq
from arriba.
Thank you for your reply.
I tried the command line for the conversion.
I think it gives similar output file with >20000 fusion results.
Also, arriba process took about 5 hours.
When I checked the STAR_log.final.out file, it is like follows:
Started job on | Feb 27 14:53:10
Started mapping on | Feb 27 14:53:31
Finished on | Feb 27 14:59:55
Mapping speed, Million of reads per hour | 537.86
Number of input reads | 57372050
Average input read length | 150
UNIQUE READS:
Uniquely mapped reads number | 4053986
Uniquely mapped reads % | 7.07%
Average mapped length | 148.53
Number of splices: Total | 593449
Number of splices: Annotated (sjdb) | 576252
Number of splices: GT/AG | 577965
Number of splices: GC/AG | 7210
Number of splices: AT/AC | 256
Number of splices: Non-canonical | 8018
Mismatch rate per base, % | 0.51%
Deletion rate per base | 0.01%
Deletion average length | 1.33
Insertion rate per base | 0.03%
Insertion average length | 1.33
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 3707714
% of reads mapped to multiple loci | 6.46%
Number of reads mapped to too many loci | 73616
% of reads mapped to too many loci | 0.13%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 49489970
% of reads unmapped: too short | 86.26%
Number of reads unmapped: other | 39041
% of reads unmapped: other | 0.07%
CHIMERIC READS:
Number of chimeric reads | 40096610
% of chimeric reads | 69.89%
It looks the same as the previous result.
Is there anything more I can try?
Thanks. :)
from arriba.
Clearly, something must be going wrong during the conversion from BAM to FastQ. It is not normal that 70% of the reads are chimeric. This number is typically well below 10%. Also, Arriba usually takes only a few minutes to run and the main output file should only be a few kb in size.
The flagstats you send me are from the original BAM file, not your realigned BAM file, right? Things still look okay in the original file, then. In this case, can you check the following:
Pick a random read pair from the original BAM file. Note down the sequences from both mates. Next, extract the sequences of the read pair with the same name from the converted FastQs, e.g., using grep -w -A1 NAME-OF-READ read1.fastq read2.fastq
. Both sequences should be the same as in the original BAM file or the reverse complement. Can you confirm this with a couple of randomly selected read pairs?
from arriba.
Hi, did you find some time to inspect the read sequences as explained in my previous post? Are my instructions clear enough? I'm pretty sure this has to do with the BAM file not being converted to FastQs properly.
from arriba.
Hi, I am very sorry for the late reply. Thank you for suggesting various solutions to try.
First, the flagstats are indeed from the original BAM file. The command used at that time is as follows.
========================================================================
samtools collate -f -O -u -r 1000000 /Users/dajeong/STAR-Fusion/data/01-15563_T1.rna.bam |
samtools fastq -0 other.fastq -1 read1.fastq -2 read2.fastq -s singletons.fastq
Convert BAM to FASTQ
/Users/dajeong/samtools-1.18/samtools fastq /Users/dajeong/STAR-Fusion/data/01-15563_T1.rna.bam -1 /Users/dajeong/STAR-Fusion/data/01-15563_T1_R1_fastq.gz -2 /Users/dajeong/STAR-Fusion/data/01-15563_T1_R2_fastq.gz
Unzip FASTQ files
gunzip -k /Users/dajeong/STAR-Fusion/data/01-15563_T1_R1_fastq
gunzip -k /Users/dajeong/STAR-Fusion/data/01-15563_T1_R2_fastq
Run STAR alignment
/Users/dajeong/STAR/bin/MacOSX_x86_64/STAR --runThreadN 5
--genomeDir /Users/dajeong/Arriba_DJ/STAR_index_hg38_GENCODE38
--readFilesIn /Users/dajeong/STAR-Fusion/data/01-15563_T1_R1_fastq /Users/dajeong/STAR-Fusion/data/01-15563_T1_R2_fastq
--outFileNamePrefix /Users/dajeong/STAR-Fusion/output/01-15563_T1_
--chimSegmentMin 20
--chimOutType WithinBAM
--alignMatesGapMax 100000
--alignIntronMax 100000
--chimJunctionOverhangMin 20
--outSAMtype BAM Unsorted
==============================================================================
Also, the results from proceeding according to the previous instruction are as follows.
##############################################################################
Trial 1.
samtools view /Users/dajeong/STAR-Fusion/data/01-15563_T1.rna.bam | gshuf -n 1 > random_read_second.txt
HS27_221:4:1213:7746:97787 83 chr7 139877304 60 75M = 139877168 -211 GCATAGCTTCTTGCTTTAAGCGTGATAGGTGCTCGATAAAGGCTCAAGAAATTGAACTGTCTACATTTCTCTTAC E1G@BE>1=<1CDFGGF>FBGGGGFF1DFC<0GGC<@1g>EGFF=:1C1GGF@;GEF@1>GGGGGGGGF1@BBB@ MD:Z:75 PG:Z:MarkDuplicates RG:Z:226820 NM:i:0 AS:i:75 XS:i:0
grep -w -A1 -h "HS27_221:4:1213:7746:97787" /Users/dajeong/STAR-Fusion/data/01-15563_T1_R1_fastq > extracted_sequence_read1_second.txt
@HS27_221:4:1213:7746:97787
GTAAGAGAAATGTAGACAGTTCAATTTCTTGAGCCTTTATCGAGCACCTATCACGCTTAAAGCAAGAAGCTATGC
grep -w -A1 -h "HS27_221:4:1213:7746:97787" /Users/dajeong/STAR-Fusion/data/01-15563_T1_R2_fastq > extracted_sequence_read2_second.txt
@HS27_221:4:1213:7746:97787
GGAGCAGGGAGGTGTACTCCTCGGATGGACGAGCGGGGAGGCAAGGCGTGTCTTTAAATAACAAGCAACCCTGCT
##############################################################################
Trial 2.
samtools view /Users/dajeong/STAR-Fusion/data/01-15563_T1.rna.bam | gshuf -n 1 > random_read_third.txt
HS27_221:4:1109:11102:98424 99 chr2 196163272 35 75M = 196163385 8093 CTTTAGATGTAAGTATATAGAAATTATTAAAGTTTTCCATTTTTATTGGAATTTGAGGAGTTGTAGTTAGTAGGC CCCCCEFGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGEGGGGGGGEFGGGGGGGG MD:Z:75 PG:Z:MarkDuplicates RG:Z:226820 NM:i:0 AS:i:75 XS:i:63
grep -w -A1 -h "HS27_221:4:1109:11102:98424" /Users/dajeong/STAR-Fusion/data/01-15563_T1_R1_fastq > extracted_sequence_read1_third.txt
@HS27_221:4:1109:11102:98424
CTTTAGATGTAAGTATATAGAAATTATTAAAGTTTTCCATTTTTATTGGAATTTGAGGAGTTGTAGTTAGTAGGC
grep -w -A1 -h "HS27_221:4:1109:11102:98424" /Users/dajeong/STAR-Fusion/data/01-15563_T1_R2_fastq > extracted_sequence_read2_third.txt
@HS27_221:4:1109:11102:98424
CAGTCCCGGGAGAACCTGCGGCGGCCGGAGCGGTAAAAATAAGTGACTAAAGAAGCAGACCTGGGAATCACCTAA
##############################################################################
Does this result suggest a problem in the file conversion process?
I would appreciate your feedback.
Thanks!
from arriba.
Thanks for the additional details. This looks good so far. May I ask you for another piece of information? Can you please run the following command on the BAM file which you have created (i.e., the one after realigning the reads)? The command extracts the randomly selected reads:
samtools view /path/to/new_alignments.bam | grep -wP 'HS27_221:4:1213:7746:97787|HS27_221:4:1109:11102:98424'
Another thing you might want to try is to run Arriba on the original BAM file. Does it also crash here? If not, then this would confirm that the problem must arise during conversion to FastQ or realignment.
from arriba.
Thank you for providing additional guidance.
Firstly, the result of executing the command you provided is as follows. I'm not sure what they mean.
samtools view /Users/dajeong/STAR-Fusion/output/01-15563_T1_STAR_Aligned.out.bam | grep -wE 'HS27_221:4:1213:7746:97787|HS27_221:4:1109:11102:98424'
(base) dajeong@DajeongcBookPro data % samtools view /Users/dajeong/STAR-Fusion/output/01-15563_T1_STAR_Aligned.out.bam | grep -wE 'HS27_221:4:1213:7746:97787|HS27_221:4:1109:11102:98424'
HS27_221:4:1109:11102:98424 97 chr2 196163272 255 75M = 196136014 0 CTTTAGATGTAAGTATATAGAAATTATTAAAGTTTTCCATTTTTATTGGAATTTGAGGAGTTGTAGTTAGTAGGC CCCCCEFGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGEGGGGGGGEFGGGGGGGG NH:i:1 HI:i:1 AS:i:73 nM:i:0 NM:i:0
HS27_221:4:1109:11102:98424 145 chr2 196136014 255 75M = 196163272 0 GTTATTCAAAAAGTTTTAAGTTATTATTAAAGAATTTGGAAACTTACCAAAATTTTGAGAAGTTAAAGGTCTAAA FGCGGGGDGGG1GGGFEGCGFC1GEGCCGGGEGGGGGGGGCEGGGGGGGGGFC1BF1>GEGGGGGGGGGGBBBBB NH:i:1 HI:i:1 AS:i:73 nM:i:0 NM:i:0
HS27_221:4:1213:7746:97787 81 chr7 139877304 255 75M = 138570982 0 GCATAGCTTCTTGCTTTAAGCGTGATAGGTGCTCGATAAAGGCTCAAGAAATTGAACTGTCTACATTTCTCTTAC E1G@BE>1=<1CDFGGF>FBGGGGFF1DFC<0GGC<@1g>EGFF=:1C1GGF@;GEF@1>GGGGGGGGF1@BBB@ NH:i:1 HI:i:1 AS:i:73 nM:i:0 NM:i:0
HS27_221:4:1213:7746:97787 161 chr7 138570982 255 22M2503N53M = 139877304 0 TAATCTTCCCTCTCTTCCGGATATTGACTGTTCAAGTACTATTATGCTGGACAATATTGTGAGGAAAGATACTAA BBBBBGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG NH:i:1 HI:i:1 AS:i:74 nM:i:0 NM:i:0
Moreover, running Arriba with the original BAM file took about 5 minutes and yielded 45 low confidence fusion results.
If the problem arises from converting BAM to FastQ files, how could it be resolved?
Thanks!
from arriba.
The output from the grep
command confirms that your realigned BAM file does not have matching paired-end read sequences anymore. Read 2 is different from what it says in the FastQ file. The most likely explanation is that the order of the reads is different in fastq1 and fastq2, which is usually caused by improper conversion from BAM to FastQ.
You can try a different tool for the conversion, for example from the biobambam
package:
conda install bioconda::biobambam
bamtofastq filename=/path/to/original_alignments.bam F=read1.fastq F2=read2.fastq S=/dev/null gz=0 collate=1
from arriba.
Did you have more success with biobambam?
from arriba.
I apologize for the late reply.
I am having difficulty using biobambam in a MacBook environment, so it might take more time.
I will leave an additional message once it is completed later.
Thank you so much.
from arriba.
Related Issues (20)
- Suppressed Sequences included in RefSeq_viral_genomes_v2.4.0.fa.gz HOT 1
- Adding more tools to plot. HOT 1
- Error in merging adjacent breakpoints? HOT 2
- Known canonical fusion reported with zero reads - need help to understand the output HOT 1
- Finding fusions and counting supporting reads zsh: killed HOT 3
- Single-End vs Paired-End behaviour for split1/2, discordant and coverage counts HOT 14
- Problem detection exom skipped HOT 2
- Issue with Dragen BAM encountering std::out_of_range error in version [v2.4.0] HOT 8
- Error occured while I running draw_fusion.R HOT 2
- Error while running draw_fusion.R HOT 8
- Issues with Missing Exon Coordinates Using "draw_fusions.R" HOT 1
- Criteria of selecting specific transcripts HOT 5
- Identifying gene fusions in plant genomes. HOT 3
- The interpretation of the contents of the result file fusions.tsv. HOT 2
- Arriba and STARfuison align HOT 7
- Fusion transcript mismatch between the same breakpoints HOT 2
- custom gtf file HOT 7
- Output fasta for TPM counting HOT 4
- Multimapper filter vs malformed SAM records - how many alignments is "too many"? HOT 11
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arriba.