Hello, I met a problem with Arriba. Before using it on my samples I'm testing it a

High false positive rate about arriba HOT 4 CLOSED

suhrig commented on August 26, 2024

High false positive rate

from arriba.

Comments (4)

suhrig commented on August 26, 2024 2

Almost all of Arriba's predictions on the BEERS dataset are read-through fusions. These are transcripts which - under real-life conditions - arise from the RNA polymerase missing a stop sign and continuing transcription beyond the end of the gene, creating a fusion-like transcript between the gene and a neighboring gene or some intergenic region in the vicinity. The second mechanism how these transcripts can be generated is through focal deletions. A common example for this is the GOPC-ROS1 fusion. There is a section on read-through fusions in Arriba's manual.

The BEERS dataset was generated using RefSeq annotation and a bunch of additional annotation files. Apart from normal transcripts annotated in RefSeq, it simulates (read-through) transcripts which are not seen in normal tissue and which are not observed under real-life conditions (certainly not at the simulated expression levels). Arriba reports these as aberrant transcripts, hence. This is sensible and even desirable, because in a cancer sample, these transcripts would indicate aberrant transcription with potentially oncogenic effect. The reason why other tools do not report them, is because they heavily penalize potential read-through transcripts. If you inspect the breakpoints of the transcripts closely, you will notice almost all of them have a distance of a few dozen kb. Most tools do not report fusion transcripts with breakpoints that close, which means they are blind to these aberrant transcripts. You can achieve the same effect with Arriba by increasing the minimum read-through distance, e.g., by passing the parameter -R 200000. I do not recommend this, though, because with real (i.e., non-simulated) sequencing data, Arriba's filters do a decent job at removing common/frequent/benign read-through fusions and there is no need to increase this parameter. Doing so will run the risk of missing fusions arising from focal deletions.

Generally speaking, simulated data are not well-suited for benchmarking fusion tools. They do not reflect the artifacts inherent to real sequencing data and they harbor artifacts that are not seen under real-life conditions (such as the read-through fusions in the BEERS dataset). This was also nicely demonstrated in the DREAM SMC-RNA Challenge, where most tools performed really well on simulated data (rounds 1-3), but suffered when real sequencing data was used (rounds 4-5). Have a look at the Leaderboards. Notice how the tools achieved high precision and recall on the simulated data, but how the accuracy of all tools deteriorated substantially when real sequencing data was used. Arriba ranked only in the top third on the simulated data, but advanced to the best-performing method with real sequencing data. Instead of a simulated dataset, you should use a real sequencing sample from benign tissue to measure the false positive rate. Here are a few sources for RNA-Seq samples from benign tissue:

Human Protein Atlas
Illumina Human BodyMap 2.0
ENCODE Project
RoadMap Epigenomics Project

from arriba.

suhrig commented on August 26, 2024 2

Evaluating the sensitivity of fusion detection tools is even harder. Benchmarking data on gene fusions are scarce. There are only a handful of somewhat well-characterized samples available (MCF-7, BT-474, SK-BR-3, ...). Here are two suggestions on how to benchmark the recall rate of fusion detection tools beyond those samples:

merge simulated fusions (generated with FUSIM and art_illumina, for example) into a real RNA-Seq sample from benign tissue
call gene fusions from RNA-Seq data and structural variants from whole-genome sequencing data and look for correlating events

With the first approach you can easily generate an arbitrary amount of true positives and the background model is still realistic. The disadvantage is that the fusion transcripts are simulated and do not reflect some of the special circumstances observed in a real-world scenario (fusions with intergenic regions are not simulated, for instance).

The second approach is 100% realistic, but depends on the accuracy of the structural variant caller used to detect SVs in the WGS data. On top of that, WGS data are scarce, too.

from arriba.

Evansef commented on August 26, 2024

Thank you very much for your answer, It's very interressant ! I understand better now

from arriba.

suhrig commented on August 26, 2024

Hi Evansef, I'm closing this issue, since your question seems to be answered. Feel free to reopen if you still need help/advice on how to benchmark fusion tools properly. Kind regards, Sebastian

from arriba.

High false positive rate about arriba HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent