Coder Social home page Coder Social logo

Support for paired-end reads? about rasusa HOT 5 CLOSED

mbhall88 avatar mbhall88 commented on June 2, 2024
Support for paired-end reads?

from rasusa.

Comments (5)

mbhall88 avatar mbhall88 commented on June 2, 2024

HA, I had an internal bet with myself about whether this would be the first issue raised.

I am happy to support any feature that users are likely to want/need and that stays within the scope of the project. This definitely seems to tick both of those boxes.

I haven't actually worked much with Illumina data previously so I will list some of the thoughts I had around this and it would be great to have your (and anyone else's) thoughts:

  • Suggesting that the user subsamples one of the read pair files to half the overall desired coverage, and then taking the read IDs in the subset and grabbing those out of the other read pair. (I would obviously provide a code snippet for how to do this).
  • Having a --paired-end flag and ask for a single fastq that has the read pairs combined. One thing to think about would be whether I would need this file interleaved or not. I guess that would depend on the implementation I go with. Would interleave vs. non-interleave be a significant issue for users?

Please also throw-in any other ideas you have.

from rasusa.

tseemann avatar tseemann commented on June 2, 2024

Ideally you would take 4 parameters like --in1 --in2 --out1 --out2 because not enough tools support interleaved format.

But for elegance i suspect you don't want that, then you could just take interleaved PE (easily created with <(seqtk mergepe R1.fq.gz R2.fq.gz). The output is usually demerged with two steps (unfortunately) using seqtk seq -1 out.fq | gzip > R1.fq.gz and then with -2 for R2. I need to find a tool that can write both from an input stream to two files.

from rasusa.

mbhall88 avatar mbhall88 commented on June 2, 2024

Ok. I will have a think about how much to achieve this then. I am heading on holidays for 5 weeks this weekend though so it is unlikely I will get around to it before then.

Regarding demerging in one step, pyfastaq can do this.

So I could envision something like this in the future

rasusa -i interleaved.fq -g 4g -c 20 | fastaq deinterleave - r1.fq r2.fq 

from rasusa.

mbhall88 avatar mbhall88 commented on June 2, 2024

Alright. I am finally revisiting this. Sorry for the massive delay. Nothing like COVID-19 to make you revisit all the stuff you should have done ages ago.

On reflection, the interleaved file method is not ideal. I think one of two ideas should work

  1. Allow --input/-i to be passed twice
  2. Switch --input to be a positional parameter instead and allow passing up to two files

In both instances, if two input files are given, it will be assumed it is illumina data.
One issue I do foresee with option 2 is that it will break any pipelines that may be using rasusa so I would have to also support --input as well. The more I type the more I like the idea of option 1.

How does that sound?

from rasusa.

tseemann avatar tseemann commented on June 2, 2024

@mbhall88 drowning in COVID work too, but this looks good. thank you

from rasusa.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.