Coder Social home page Coder Social logo

High false positive rate about arriba HOT 4 CLOSED

suhrig avatar suhrig commented on August 26, 2024
High false positive rate

from arriba.

Comments (4)

suhrig avatar suhrig commented on August 26, 2024 2

Almost all of Arriba's predictions on the BEERS dataset are read-through fusions. These are transcripts which - under real-life conditions - arise from the RNA polymerase missing a stop sign and continuing transcription beyond the end of the gene, creating a fusion-like transcript between the gene and a neighboring gene or some intergenic region in the vicinity. The second mechanism how these transcripts can be generated is through focal deletions. A common example for this is the GOPC-ROS1 fusion. There is a section on read-through fusions in Arriba's manual.

The BEERS dataset was generated using RefSeq annotation and a bunch of additional annotation files. Apart from normal transcripts annotated in RefSeq, it simulates (read-through) transcripts which are not seen in normal tissue and which are not observed under real-life conditions (certainly not at the simulated expression levels). Arriba reports these as aberrant transcripts, hence. This is sensible and even desirable, because in a cancer sample, these transcripts would indicate aberrant transcription with potentially oncogenic effect. The reason why other tools do not report them, is because they heavily penalize potential read-through transcripts. If you inspect the breakpoints of the transcripts closely, you will notice almost all of them have a distance of a few dozen kb. Most tools do not report fusion transcripts with breakpoints that close, which means they are blind to these aberrant transcripts. You can achieve the same effect with Arriba by increasing the minimum read-through distance, e.g., by passing the parameter -R 200000. I do not recommend this, though, because with real (i.e., non-simulated) sequencing data, Arriba's filters do a decent job at removing common/frequent/benign read-through fusions and there is no need to increase this parameter. Doing so will run the risk of missing fusions arising from focal deletions.

Generally speaking, simulated data are not well-suited for benchmarking fusion tools. They do not reflect the artifacts inherent to real sequencing data and they harbor artifacts that are not seen under real-life conditions (such as the read-through fusions in the BEERS dataset). This was also nicely demonstrated in the DREAM SMC-RNA Challenge, where most tools performed really well on simulated data (rounds 1-3), but suffered when real sequencing data was used (rounds 4-5). Have a look at the Leaderboards. Notice how the tools achieved high precision and recall on the simulated data, but how the accuracy of all tools deteriorated substantially when real sequencing data was used. Arriba ranked only in the top third on the simulated data, but advanced to the best-performing method with real sequencing data. Instead of a simulated dataset, you should use a real sequencing sample from benign tissue to measure the false positive rate. Here are a few sources for RNA-Seq samples from benign tissue:

from arriba.

suhrig avatar suhrig commented on August 26, 2024 2

Evaluating the sensitivity of fusion detection tools is even harder. Benchmarking data on gene fusions are scarce. There are only a handful of somewhat well-characterized samples available (MCF-7, BT-474, SK-BR-3, ...). Here are two suggestions on how to benchmark the recall rate of fusion detection tools beyond those samples:

  • merge simulated fusions (generated with FUSIM and art_illumina, for example) into a real RNA-Seq sample from benign tissue
  • call gene fusions from RNA-Seq data and structural variants from whole-genome sequencing data and look for correlating events

With the first approach you can easily generate an arbitrary amount of true positives and the background model is still realistic. The disadvantage is that the fusion transcripts are simulated and do not reflect some of the special circumstances observed in a real-world scenario (fusions with intergenic regions are not simulated, for instance).

The second approach is 100% realistic, but depends on the accuracy of the structural variant caller used to detect SVs in the WGS data. On top of that, WGS data are scarce, too.

from arriba.

Evansef avatar Evansef commented on August 26, 2024

Thank you very much for your answer, It's very interressant ! I understand better now

from arriba.

suhrig avatar suhrig commented on August 26, 2024

Hi Evansef, I'm closing this issue, since your question seems to be answered. Feel free to reopen if you still need help/advice on how to benchmark fusion tools properly. Kind regards, Sebastian

from arriba.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.