Hi, I am using Arriba to analyze RNA fusions. I downloaded publi

Comments (17)

suhrig commented on July 23, 2024

Something does not seem right about this sample. In fact, the situation looks a lot similar to your other issue #228. Since it's public data, can you point me to the source from where you obtained it? Then I can take a look at it myself.

from arriba.

jdjdj0202 commented on July 23, 2024

Thank you for your reply.
After receiving data access approval from EGA, I downloaded the RNA data.
It was in BAM file format, which I converted into a FASTQ file, then used STAR-Align followed by Arriba.
The dataset ID is EGAD00001006268 = EGAS00001004289.
Among the dataset, I tried EGAF00004171085 (file name: 01-15563_T1.rna.bam) and EGAF00004171086 (file name: 01-17618_T1.rna.bam) etc.

(If access to the restricted public data is difficult, Is it okay to explain the situation to the data owner and seek permission?)

Thanks a lot.

Sincerely,
DJ. J

from arriba.

suhrig commented on July 23, 2024

Apologies for the slow response - it has been a busy week.

I didn't realize it's restricted data from EGA. I doubt it will be possible to share the data with me. So we will need to resort to remote troubleshooting.

Can you run samtools flagstat on the problematic BAM file, please? Also, do you have the STAR Log.final.out file and could share it with me?

Looking at the status messages from Arriba, about 54 million reads were considered as chimeric. That's almost all reads. This could happen when read1 and read2 were accidentally swapped when given as input to STAR. Can you rule this out?

from arriba.

jdjdj0202 commented on July 23, 2024

I am very sorry for the late response due to personal reasons. I have tried running samtools flagstat and the outcome is as follows:

108111052 + 6633048 in total (QC-passed reads + QC-failed reads)
108111052 + 6633048 primary
0 + 0 secondary
0 + 0 supplementary
8903170 + 805612 duplicates
8903170 + 805612 primary duplicates
107740389 + 5690274 mapped (99.66% : 85.79%)
107740389 + 5690274 primary mapped (99.66% : 85.79%)
108111052 + 6633048 paired in sequencing
54055526 + 3316524 read1
54055526 + 3316524 read2
92916680 + 4440744 properly paired (85.95% : 66.95%)
107438656 + 5331394 with itself and mate mapped
301733 + 358880 singletons (0.28% : 5.41%)
10155368 + 676362 with mate mapped to a different chr
1489734 + 157793 with mate mapped to a different chr (mapQ>=5)

The STAR log.file.out could not be attached here, so I have attached it via email.

Thanks!

from arriba.

jdjdj0202 commented on July 23, 2024

I am very sorry for the late response due to personal reasons. I have tried running samtools flagstat and the outcome is as follows: 108111052 + 6633048 in total (QC-passed reads + QC-failed reads) 108111052 + 6633048 primary 0 + 0 secondary 0 + 0 supplementary 8903170 + 805612 duplicates 8903170 + 805612 primary duplicates 107740389 + 5690274 mapped (99.66% : 85.79%) 107740389 + 5690274 primary mapped (99.66% : 85.79%) 108111052 + 6633048 paired in sequencing 54055526 + 3316524 read1 54055526 + 3316524 read2 92916680 + 4440744 properly paired (85.95% : 66.95%) 107438656 + 5331394 with itself and mate mapped 301733 + 358880 singletons (0.28% : 5.41%) 10155368 + 676362 with mate mapped to a different chr 1489734 + 157793 with mate mapped to a different chr (mapQ>=5) The STAR log.file.out could not be attached here, so I have attached it via email. Thanks! Sincerely, DJ. J 2024년 2월 10일 (토) 오전 8:48, suhrig ***@***.***>님이 작성:

…

Apologies for the slow response - it has been a busy week. I didn't realize it's restricted data from EGA. I doubt it will be possible to share the data with me. So we will need to resort to remote troubleshooting. Can you run samtools flagstat on the problematic BAM file, please? Also, do you have the STAR Log.final.out file and could share it with me? Looking at the status messages from Arriba, about 54 million reads were considered as chimeric. That's almost all reads. This could happen when read1 and read2 were accidentally swapped when given as input to STAR. Can you rule this out? — Reply to this email directly, view it on GitHub <#229 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQKDW3WASPFXNIKRLSGK4ALYS2YT7AVCNFSM6AAAAABCV3XY76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZWG42DOMRYGM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- DAJEONG JEONG, M.D., Ph.D. Department of Laboratory Medicine Yonsei University College of Medicine, Severance Hospital 50-1, Yonsei-ro, Seodaemun-gu, Seoul, Republic of Korea (03722) Tel: 82-2-2228-2444 Fax: 82-2-2227-8353 Cel: 82-10-4901-9486

from arriba.

suhrig commented on July 23, 2024

Unfortunately, the log file was also dropped by mail. Can you just copy-paste the content as a reply to this thread?

The flagstats already look weird. The are no supplementary alignments. Can you show me the STAR command, please?

from arriba.

jdjdj0202 commented on July 23, 2024

Thank you for your reply.

The STAR_log.final.out file result is as follows:
Started job on | Jan 22 19:51:31
Started mapping on | Jan 22 19:51:44
Finished on | Jan 22 19:58:08
Mapping speed, Million of reads per hour | 537.86

                      Number of input reads |	57372050
                  Average input read length |	150
                                UNIQUE READS:
               Uniquely mapped reads number |	4053986
                    Uniquely mapped reads % |	7.07%
                      Average mapped length |	148.53
                   Number of splices: Total |	593449
        Number of splices: Annotated (sjdb) |	576252
                   Number of splices: GT/AG |	577965
                   Number of splices: GC/AG |	7210
                   Number of splices: AT/AC |	256
           Number of splices: Non-canonical |	8018
                  Mismatch rate per base, % |	0.51%
                     Deletion rate per base |	0.01%
                    Deletion average length |	1.33
                    Insertion rate per base |	0.03%
                   Insertion average length |	1.33
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |	3707714
         % of reads mapped to multiple loci |	6.46%
    Number of reads mapped to too many loci |	73616
         % of reads mapped to too many loci |	0.13%
                              UNMAPPED READS:

Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 49489970
% of reads unmapped: too short | 86.26%
Number of reads unmapped: other | 39041
% of reads unmapped: other | 0.07%
CHIMERIC READS:
Number of chimeric reads | 40096610
% of chimeric reads | 69.89%

The STAR command is as follows:

Convert BAM to FASTQ

/Users/dajeong/samtools-1.18/samtools fastq /Users/dajeong/STAR-Fusion/data/01-15563_T1.rna.bam -1 /Users/dajeong/STAR-Fusion/data/01-15563_T1_R1_fastq.gz -2 /Users/dajeong/STAR-Fusion/data/01-15563_T1_R2_fastq.gz

Unzip FASTQ files

gunzip -k /Users/dajeong/STAR-Fusion/data01-15563_T1_R1_fastq
gunzip -k /Users/dajeong/STAR-Fusion/data/01-15563_T1_R2_fastq

Run STAR alignment

/Users/dajeong/STAR/bin/MacOSX_x86_64/STAR --runThreadN 5
--genomeDir /Users/dajeong/Arriba_DJ/STAR_index_hg38_GENCODE38
--readFilesIn /Users/dajeong/STAR-Fusion/data/01-15563_T1_R1_fastq /Users/dajeong/STAR-Fusion/data/01-15563_T1_R2_fastq
--outFileNamePrefix /Users/dajeong/STAR-Fusion/output/01-15563_T1_
--chimSegmentMin 20
--chimOutType WithinBAM
--alignMatesGapMax 100000
--alignIntronMax 100000
--chimJunctionOverhangMin 20
--outSAMtype BAM Unsorted

Rename the output file

mv /Users/dajeong/STAR-Fusion/output/A37987_Aligned.out.bam /Users/dajeong/STAR-Fusion/output/A37987_STAR_Aligned.out.bam

Run Arriba for fusion detection

/Users/dajeong/arriba_v2.4.0/arriba -x /Users/dajeong/STAR-Fusion/output/01-15563_T1_STAR_Aligned.out.bam -g /Users/dajeong/Arriba_DJ/GENCODE38.gtf -a /Users/dajeong/Arriba_DJ/hg38.fa
-b /Users/dajeong/arriba_v2.4.0/database/blacklist_hg38_GRCh38_v2.4.0.tsv.gz -k /Users/dajeong/arriba_v2.4.0/database/known_fusions_hg38_GRCh38_v2.4.0.tsv.gz -p /Users/dajeong/arriba_v2.4.0/database/protein_domains_hg38_GRCh38_v2.4.0.gff3
-o /Users/dajeong/Arriba_DJ/output/01-15563_T1_fusions.tsv

Thanks :)

from arriba.

suhrig commented on July 23, 2024

The problem could be that you haven't collated the paired-end mates before converting to FastQ. The mates will be mixed up if you don't do this.

Would you please try the following command for the conversion?

samtools collate -f -O -u -r 1000000  /Users/dajeong/STAR-Fusion/data/01-15563_T1.rna.bam |
samtools fastq -0 other.fastq -1 read1.fastq -2 read2.fastq -s singletons.fastq

from arriba.

jdjdj0202 commented on July 23, 2024

Thank you for your reply.
I tried the command line for the conversion.
I think it gives similar output file with >20000 fusion results.
Also, arriba process took about 5 hours.
When I checked the STAR_log.final.out file, it is like follows:

Started job on | Feb 27 14:53:10
Started mapping on | Feb 27 14:53:31
Finished on | Feb 27 14:59:55
Mapping speed, Million of reads per hour | 537.86

                      Number of input reads |	57372050
                  Average input read length |	150
                                UNIQUE READS:
               Uniquely mapped reads number |	4053986
                    Uniquely mapped reads % |	7.07%
                      Average mapped length |	148.53
                   Number of splices: Total |	593449
        Number of splices: Annotated (sjdb) |	576252
                   Number of splices: GT/AG |	577965
                   Number of splices: GC/AG |	7210
                   Number of splices: AT/AC |	256
           Number of splices: Non-canonical |	8018
                  Mismatch rate per base, % |	0.51%
                     Deletion rate per base |	0.01%
                    Deletion average length |	1.33
                    Insertion rate per base |	0.03%
                   Insertion average length |	1.33
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |	3707714
         % of reads mapped to multiple loci |	6.46%
    Number of reads mapped to too many loci |	73616
         % of reads mapped to too many loci |	0.13%
                              UNMAPPED READS:

It looks the same as the previous result.
Is there anything more I can try?

Thanks. :)

from arriba.

suhrig commented on July 23, 2024

Clearly, something must be going wrong during the conversion from BAM to FastQ. It is not normal that 70% of the reads are chimeric. This number is typically well below 10%. Also, Arriba usually takes only a few minutes to run and the main output file should only be a few kb in size.

The flagstats you send me are from the original BAM file, not your realigned BAM file, right? Things still look okay in the original file, then. In this case, can you check the following:

Pick a random read pair from the original BAM file. Note down the sequences from both mates. Next, extract the sequences of the read pair with the same name from the converted FastQs, e.g., using grep -w -A1 NAME-OF-READ read1.fastq read2.fastq. Both sequences should be the same as in the original BAM file or the reverse complement. Can you confirm this with a couple of randomly selected read pairs?

from arriba.

suhrig commented on July 23, 2024

Hi, did you find some time to inspect the read sequences as explained in my previous post? Are my instructions clear enough? I'm pretty sure this has to do with the BAM file not being converted to FastQs properly.

from arriba.

jdjdj0202 commented on July 23, 2024

Hi, I am very sorry for the late reply. Thank you for suggesting various solutions to try.

First, the flagstats are indeed from the original BAM file. The command used at that time is as follows.

========================================================================
samtools collate -f -O -u -r 1000000 /Users/dajeong/STAR-Fusion/data/01-15563_T1.rna.bam |
samtools fastq -0 other.fastq -1 read1.fastq -2 read2.fastq -s singletons.fastq

Convert BAM to FASTQ

Unzip FASTQ files

gunzip -k /Users/dajeong/STAR-Fusion/data/01-15563_T1_R1_fastq
gunzip -k /Users/dajeong/STAR-Fusion/data/01-15563_T1_R2_fastq

Run STAR alignment

==============================================================================
Also, the results from proceeding according to the previous instruction are as follows.

##############################################################################
Trial 1.
samtools view /Users/dajeong/STAR-Fusion/data/01-15563_T1.rna.bam | gshuf -n 1 > random_read_second.txt

HS27_221:4:1213:7746:97787 83 chr7 139877304 60 75M = 139877168 -211 GCATAGCTTCTTGCTTTAAGCGTGATAGGTGCTCGATAAAGGCTCAAGAAATTGAACTGTCTACATTTCTCTTAC E1G@BE>1=<1CDFGGF>FBGGGGFF1DFC<0GGC<@1g>EGFF=:1C1GGF@;GEF@1>GGGGGGGGF1@BBB@ MD:Z:75 PG:Z:MarkDuplicates RG:Z:226820 NM:i:0 AS:i:75 XS:i:0

grep -w -A1 -h "HS27_221:4:1213:7746:97787" /Users/dajeong/STAR-Fusion/data/01-15563_T1_R1_fastq > extracted_sequence_read1_second.txt

@HS27_221:4:1213:7746:97787
GTAAGAGAAATGTAGACAGTTCAATTTCTTGAGCCTTTATCGAGCACCTATCACGCTTAAAGCAAGAAGCTATGC

grep -w -A1 -h "HS27_221:4:1213:7746:97787" /Users/dajeong/STAR-Fusion/data/01-15563_T1_R2_fastq > extracted_sequence_read2_second.txt

@HS27_221:4:1213:7746:97787
GGAGCAGGGAGGTGTACTCCTCGGATGGACGAGCGGGGAGGCAAGGCGTGTCTTTAAATAACAAGCAACCCTGCT

##############################################################################

Trial 2.

samtools view /Users/dajeong/STAR-Fusion/data/01-15563_T1.rna.bam | gshuf -n 1 > random_read_third.txt

HS27_221:4:1109:11102:98424 99 chr2 196163272 35 75M = 196163385 8093 CTTTAGATGTAAGTATATAGAAATTATTAAAGTTTTCCATTTTTATTGGAATTTGAGGAGTTGTAGTTAGTAGGC CCCCCEFGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGEGGGGGGGEFGGGGGGGG MD:Z:75 PG:Z:MarkDuplicates RG:Z:226820 NM:i:0 AS:i:75 XS:i:63

grep -w -A1 -h "HS27_221:4:1109:11102:98424" /Users/dajeong/STAR-Fusion/data/01-15563_T1_R1_fastq > extracted_sequence_read1_third.txt

@HS27_221:4:1109:11102:98424
CTTTAGATGTAAGTATATAGAAATTATTAAAGTTTTCCATTTTTATTGGAATTTGAGGAGTTGTAGTTAGTAGGC

grep -w -A1 -h "HS27_221:4:1109:11102:98424" /Users/dajeong/STAR-Fusion/data/01-15563_T1_R2_fastq > extracted_sequence_read2_third.txt

@HS27_221:4:1109:11102:98424
CAGTCCCGGGAGAACCTGCGGCGGCCGGAGCGGTAAAAATAAGTGACTAAAGAAGCAGACCTGGGAATCACCTAA

##############################################################################

Does this result suggest a problem in the file conversion process?
I would appreciate your feedback.

Thanks!

from arriba.

suhrig commented on July 23, 2024

Thanks for the additional details. This looks good so far. May I ask you for another piece of information? Can you please run the following command on the BAM file which you have created (i.e., the one after realigning the reads)? The command extracts the randomly selected reads:

samtools view /path/to/new_alignments.bam | grep -wP 'HS27_221:4:1213:7746:97787|HS27_221:4:1109:11102:98424'

Another thing you might want to try is to run Arriba on the original BAM file. Does it also crash here? If not, then this would confirm that the problem must arise during conversion to FastQ or realignment.

from arriba.

jdjdj0202 commented on July 23, 2024

Thank you for providing additional guidance.
Firstly, the result of executing the command you provided is as follows. I'm not sure what they mean.

samtools view /Users/dajeong/STAR-Fusion/output/01-15563_T1_STAR_Aligned.out.bam | grep -wE 'HS27_221:4:1213:7746:97787|HS27_221:4:1109:11102:98424'

(base) dajeong@DajeongcBookPro data % samtools view /Users/dajeong/STAR-Fusion/output/01-15563_T1_STAR_Aligned.out.bam | grep -wE 'HS27_221:4:1213:7746:97787|HS27_221:4:1109:11102:98424'
HS27_221:4:1109:11102:98424 97 chr2 196163272 255 75M = 196136014 0 CTTTAGATGTAAGTATATAGAAATTATTAAAGTTTTCCATTTTTATTGGAATTTGAGGAGTTGTAGTTAGTAGGC CCCCCEFGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGEGGGGGGGEFGGGGGGGG NH:i:1 HI:i:1 AS:i:73 nM:i:0 NM:i:0
HS27_221:4:1109:11102:98424 145 chr2 196136014 255 75M = 196163272 0 GTTATTCAAAAAGTTTTAAGTTATTATTAAAGAATTTGGAAACTTACCAAAATTTTGAGAAGTTAAAGGTCTAAA FGCGGGGDGGG1GGGFEGCGFC1GEGCCGGGEGGGGGGGGCEGGGGGGGGGFC1BF1>GEGGGGGGGGGGBBBBB NH:i:1 HI:i:1 AS:i:73 nM:i:0 NM:i:0
HS27_221:4:1213:7746:97787 81 chr7 139877304 255 75M = 138570982 0 GCATAGCTTCTTGCTTTAAGCGTGATAGGTGCTCGATAAAGGCTCAAGAAATTGAACTGTCTACATTTCTCTTAC E1G@BE>1=<1CDFGGF>FBGGGGFF1DFC<0GGC<@1g>EGFF=:1C1GGF@;GEF@1>GGGGGGGGF1@BBB@ NH:i:1 HI:i:1 AS:i:73 nM:i:0 NM:i:0
HS27_221:4:1213:7746:97787 161 chr7 138570982 255 22M2503N53M = 139877304 0 TAATCTTCCCTCTCTTCCGGATATTGACTGTTCAAGTACTATTATGCTGGACAATATTGTGAGGAAAGATACTAA BBBBBGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG NH:i:1 HI:i:1 AS:i:74 nM:i:0 NM:i:0

Moreover, running Arriba with the original BAM file took about 5 minutes and yielded 45 low confidence fusion results.

If the problem arises from converting BAM to FastQ files, how could it be resolved?
Thanks!

from arriba.

suhrig commented on July 23, 2024

The output from the grep command confirms that your realigned BAM file does not have matching paired-end read sequences anymore. Read 2 is different from what it says in the FastQ file. The most likely explanation is that the order of the reads is different in fastq1 and fastq2, which is usually caused by improper conversion from BAM to FastQ.

You can try a different tool for the conversion, for example from the biobambam package:

conda install bioconda::biobambam
bamtofastq filename=/path/to/original_alignments.bam F=read1.fastq F2=read2.fastq S=/dev/null gz=0 collate=1

from arriba.

suhrig commented on July 23, 2024

Did you have more success with biobambam?

from arriba.

jdjdj0202 commented on July 23, 2024

I apologize for the late reply.
I am having difficulty using biobambam in a MacBook environment, so it might take more time.
I will leave an additional message once it is completed later.
Thank you so much.

from arriba.

Finding fusions and counting supporting reads zsh: killed about arriba HOT 17 OPEN

Comments (17)

Convert BAM to FASTQ

Unzip FASTQ files

Run STAR alignment

Rename the output file

Run Arriba for fusion detection

Convert BAM to FASTQ

Unzip FASTQ files

Run STAR alignment

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent