Comments (5)
HA, I had an internal bet with myself about whether this would be the first issue raised.
I am happy to support any feature that users are likely to want/need and that stays within the scope of the project. This definitely seems to tick both of those boxes.
I haven't actually worked much with Illumina data previously so I will list some of the thoughts I had around this and it would be great to have your (and anyone else's) thoughts:
- Suggesting that the user subsamples one of the read pair files to half the overall desired coverage, and then taking the read IDs in the subset and grabbing those out of the other read pair. (I would obviously provide a code snippet for how to do this).
- Having a
--paired-end
flag and ask for a single fastq that has the read pairs combined. One thing to think about would be whether I would need this file interleaved or not. I guess that would depend on the implementation I go with. Would interleave vs. non-interleave be a significant issue for users?
Please also throw-in any other ideas you have.
from rasusa.
Ideally you would take 4 parameters like --in1 --in2 --out1 --out2
because not enough tools support interleaved format.
But for elegance i suspect you don't want that, then you could just take interleaved PE (easily created with <(seqtk mergepe R1.fq.gz R2.fq.gz)
. The output is usually demerged with two steps (unfortunately) using seqtk seq -1 out.fq | gzip > R1.fq.gz
and then with -2
for R2
. I need to find a tool that can write both from an input stream to two files.
from rasusa.
Ok. I will have a think about how much to achieve this then. I am heading on holidays for 5 weeks this weekend though so it is unlikely I will get around to it before then.
Regarding demerging in one step, pyfastaq can do this.
So I could envision something like this in the future
rasusa -i interleaved.fq -g 4g -c 20 | fastaq deinterleave - r1.fq r2.fq
from rasusa.
Alright. I am finally revisiting this. Sorry for the massive delay. Nothing like COVID-19 to make you revisit all the stuff you should have done ages ago.
On reflection, the interleaved file method is not ideal. I think one of two ideas should work
- Allow
--input/-i
to be passed twice - Switch
--input
to be a positional parameter instead and allow passing up to two files
In both instances, if two input files are given, it will be assumed it is illumina data.
One issue I do foresee with option 2 is that it will break any pipelines that may be using rasusa so I would have to also support --input
as well. The more I type the more I like the idea of option 1.
How does that sound?
from rasusa.
@mbhall88 drowning in COVID work too, but this looks good. thank you
from rasusa.
Related Issues (20)
- Estimate genome size automatically HOT 8
- Relationship to filtlong HOT 2
- Support for subsampling alignment to uniform coverage HOT 5
- Finer control over coverage HOT 2
- Multi-threading approach implementation HOT 3
- Lower than expected coverage for illumina paired end reads HOT 9
- Suggestion: replace needletail by noodles and niffler HOT 1
- `HashSet` is slow HOT 3
- Compress output when requested HOT 5
- Input parameter for number of bases in addition to coverage and genome size HOT 6
- Calculate genome size from an index file HOT 1
- Add number and fraction options
- Docker image issue HOT 9
- Suggestion: Raise warning when desired level of coverage is not possible HOT 1
- Feature: Min/max coverage threshold HOT 2
- Multi-threading approach HOT 7
- Does rasusa outputs reads that still cover the entire original genome? HOT 1
- Random sampling based on bases for the metagenomic dataset HOT 1
- illumina read input error HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rasusa.