Coder Social home page Coder Social logo

Comments (13)

suhrig avatar suhrig commented on July 23, 2024

Upfront some comments about your command:

  • I'm assuming you try to run Arriba on non-human data. Is that correct? What organism is it, if I may ask? Probably that is why you disable the blacklist and why your files have generic names. Generally speaking that's fine, I'm just curious so that I can better judge the situation. Just beware that you will have more false positives than usual if you don't use a blacklist. Also beware that Arriba only reports fusions if at least one breakpoint is near an annotated exon as specified by the GTF file. You need to disable the filter intronic to have it report fusions with both breakpoints in introns or intergenic regions. This might be relevant in case the gene annotation of your organism is incomplete or in case you expect many breakpoints in introns/intergenic regions.

  • Second, you should not pass the junction file to Arriba. The parameter -c takes alignments as input, i.e. Chimeric.out.sam. If you only have the junction file, then probably you forgot to specify --chimOutType. See this page for some explanation or have a look at the demo script for an example how to run STAR.

  • There should be a blank between fusions.discarded.tsv and -d.

  • The structural variant file passed via -d should not be in VCF format. It is expected to have a simple 4-column TSV format as described in the manual. Out of curiosity, what tool did you use to call the structural variants? I might be able to help you with format conversion.

  • The default value of -s is auto, so you don't necessarily need to specify it.

  • The value of 0.1 for the parameter -V might be to strict. It effectively means that reads of 100nt length are discarded, if they have more that two mismatches. This might be intentional in your case, but the default value permits a few more mismatches, which has proven to work better in practice in my experience.

Now to the actual error:

Your suspicion is probably spot-on that the parameter -i needs to be adapted. This parameter specifies which contigs are reported by Arriba in the main output file. Fusions involving other contigs are discarded. For every contig mentioned in the argument of -i, Arriba checks if it can find the sequence of this contig in the assembly (genome.fna). The default value are all human chromosomes. So if your organism has fewer chromosomes, the given error will be thrown. Here is what you need to do: Find the names of all contigs of your organism. Let's say 1, 2, and 3 for example. Then pass this list as an argument to -i, for example: -i "1 2 3".

Feel free to ask for more details if any of my explanations are not clear. BTW all arguments are documented in the manual, in case you have not found it yet.

from arriba.

lucascaue avatar lucascaue commented on July 23, 2024

Hi! Yes I am using a hybrid Human genome with provirus. The VCF file was made by bcftools mpileup command.
I followed your instructions, I changed the --ChimOutType to --chimOutType WithinBAM and suppressed the -V and -d commands.
Now, I can run arriba with no errors, but 99.9% of the mismatches were filtered in this step:

Filtering duplicates (remaining=102394)
Filtering mates which do not map to interesting contigs (remaining=1)
Estimating mate gap distributionWARNING: not enough chimeric reads to estimate mate gap distribution, using default values

I am pretty sure that are at least 50 chimeric alignments. Do you have any idea what should I do now? Maybe one of the filters of -f should be suppressed?

Thank you so much for your help!

from arriba.

suhrig avatar suhrig commented on July 23, 2024

I understand you are looking for sites of viral integration into the human genome? In that case, you must specify both the human and the viral chromosomes as an argument to -i. In addition, I recommend disabling the intronic filter, otherwise you will miss some events where the virus is integrated into intergenic / intronic loci.

When you used --chimOutType WithinBAM you don't need -c, because the chimeric alignments are stored in the main alignments file passed via -x. I guess you figured this out yourself; I just want to mention this in case you haven't.

from arriba.

suhrig avatar suhrig commented on July 23, 2024

Accidently hit the close issue button. Reopening...

from arriba.

suhrig avatar suhrig commented on July 23, 2024

The output of bcftools mpileup is of no use to Arriba. Mpileup calls SNVs and indels, not structural variants, so it makes sense that you removed this argument.

How many reads map to the human and to the viral chromosomes? Have you checked using samtools idxstats?

from arriba.

lucascaue avatar lucascaue commented on July 23, 2024

Hi! Yes I did that modifications you suggest! I realized that I have to declare the chromosomes with -i CONTIGS but just COMMA separeted list , when I declared something like -i 1 2 3 4 , arriba just recognize the first string and when I typed -i 1,2,3,4 it did recognize all 4 of them. Before this process I had to run STAR with the tag --chimOutType Junctions for some reason I can not use --chimOutType WithinBAM Junction. Then as I got the chimeric alignments file I can identify the best chimeric alignments and where are the breakpoints and annotate them to declare on -i CONTIGS. I Think right now I just have to play a little bit with the filter parameters.

I don't know how to understand the samtools idxstats

Thanks so much for your help suhrig!

from arriba.

suhrig avatar suhrig commented on July 23, 2024

when I declared something like -i 1 2 3 4 , arriba just recognize the first string

Other users have run into the same issue. It is actually not an error of Arriba, but invalid bash syntax. You would run into the same issue, if you tried to dump the content of a file using cat if the file name contains a space, e.g. cat foo bar will issue an error:

cat: foo: No such file or directory
cat: bar: No such file or directory

The proper BASH syntax is to put it in double quotes: cat "foo bar"

To prevent more users from bumping into this obstacle future versions of Arriba will issue a clear error message.

I had to run STAR with the tag --chimOutType Junctions for some reason I can not use --chimOutType WithinBAM Junction

Note that WithinBAM is mandatory, if you want to use Arriba for fusion detection. Specifying Junctions alone will not work. If your version of STAR does not permit specifying Junctions and WithinBAM at the same time, you need to either drop Junctions or add --chimMultimapNmax 0. See this issue from the developer of STAR.

BTW, my version of STAR (2.7.3a) permits the use of WithinBAM and Junctions at the same time. Maybe you made a typo. You write Junction once and Junctions another time.

from arriba.

lucascaue avatar lucascaue commented on July 23, 2024

Exactly! I saw that post before. Now I could do the Arriba Analysis, and then following the script I ran draw_fusions.R but I got this error:

Loading annotation
Loading protein domains
Drawing fusion #1: UBE2E2:tat(4610)
Drawing fusion #2: LOC105371031:tat
Error in if (bands[band, "giemsa"] != "acen") { :
missing value where TRUE/FALSE needed
Calls: drawIdeogram
Execution halted

Do you know What is this about? I did not find that on draws._fusion.R code

from arriba.

suhrig avatar suhrig commented on July 23, 2024

You need to add a line to the cytobands file (tab-separated):

tat	1	SIZE_OF_tat	tat	gneg

where SIZE_OF_tat is the size of the tat chromosome.

The script crashes, because it looks for information about Giemsa staining of the tat chromosome, but can't find any. I should probably make the script robust enough that it does not crash, when this information is missing. Good point, thanks for making me aware of this!

from arriba.

lucascaue avatar lucascaue commented on July 23, 2024

Hi, Thanks for your comments. Definitely I had the identification for my fusions, and now I am just having this problem when I am run draw_fusions.R

draw_fusions.R --annotation=genannotation_mod.gtf --fusions=S9fusions_mod.tsv --output=S9output.pdf --alignments=S9Aligned.sortedByCoord.out.bam --cytobands=cytobands_mod.tsv --proteinDomains=protein_domains_sorted.gff3 --minConfidenceForCircosPlot=low

Loading annotation
Loading protein domains
Drawing fusion #1: LPP:virtual-gene
Error in if (confidenceRank[f$confidence] >= confidenceRank[minConfidenceForCircosPlot] || :
missing value where TRUE/FALSE needed
Calls: drawCircos
Execution halted

Do you know what is happening here? I think this is the last problem, I could not find this error identification on Arriba code. Even when I omitted minConfidenceForCircosPlot=low I got this error

from arriba.

suhrig avatar suhrig commented on July 23, 2024

Did you modify the fusions file? My guess is the confidence column was shifted/accidentally edited and now contains an invalid value. It must contain either high, medium, or low. Any other value will trigger this error. Can you double-check that the fusions file is in correct format?

from arriba.

lucascaue avatar lucascaue commented on July 23, 2024

Hi. I checked that, it was made by a blank row in the final of the file. Now it is working perfectly! Thank you so much for your help!

from arriba.

suhrig avatar suhrig commented on July 23, 2024

A new version of Arriba has just been released. It includes fixes for the issues mentioned in this thread. Specifically, a clear error message is thrown, when a user forgets to wrap arguments with blanks in quotes. Moreover, a warning is issued when the junctions file is passed to Arriba. And the draw_fusions.R script no longer chokes when cytoband information is missing. I'm closing this. Feel free to reopen, if you still have issues.

from arriba.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.