Comments (9)
When you say you get 240x instead of 250x is that based on the rasusa logs or some other way of calculating this?
The way rasusa subsamples for paired reads is it gathers all of the read lengths for the first fastq file (R1), and randomly selects reads until the total length of those sampled reads is half of the required total read length for the given genome size and coverage. It then outputs those sampled reads and their mate in R2. So there is an implicit assumption here that the mate has the same read length. So if you're seeing 240x instead of 250x that suggests your R2 might have had more bases trimmed than R1...
from rasusa.
Hi,
Thank you for your response. It is typical for R2 to have lower quality than R1 and therefore one could expect R2 reads to be shorter after quality trimming. Given your explanation of how rasusa works, I understand the reason for the difference in coverage.
I get 240x instead of 250x calculated by taking the total number of base pairs after subsampling with rasusa and dividing by the genome size provided to rasusa. If I remember correctly, this matched the rasusa logs but I don't have these at hand.
Thanks,
Ilya.
from rasusa.
I would suggest summing the read length of R1 and R2 and randomly selecting read pairs until the total length of those sampled reads matches the required total read length for the given genome size and coverage.
from rasusa.
Ok, that makes sense then.
I would suggest summing the read length of R1 and R2 and randomly selecting read pairs until the total length of those sampled reads matches the required total read length for the given genome size and coverage.
This is a more accurate solution to the than what I currently do for sure. I'll look at trying to change the implementation to do this.
from rasusa.
Great. Thank you :)
from rasusa.
Hi,
Any idea when this fix will be implemented and distributed?
Thanks,
Ilya.
from rasusa.
Sorry @ilyavs , I am currently writing up my PhD thesis so I don't have a huge amount of spare time. However, it is still very much high on my list of things to do so hopefully I can get around to it soon.
from rasusa.
Hi @ilyavs, would be able to do me a favour and test out b72405e and see if it resolves this? You can either build from the source in that commit, or if you are using a container you can use the following image
# for docker
quay.io/mbhall88/rasusa:b72405e
# for singularity
docker://quay.io/mbhall88/rasusa:b72405e
from rasusa.
This should be sorted in version 0.4.0. Let me know if there are still problems
from rasusa.
Related Issues (20)
- Estimate genome size automatically HOT 8
- Support for paired-end reads? HOT 5
- Relationship to filtlong HOT 2
- Support for subsampling alignment to uniform coverage HOT 5
- Finer control over coverage HOT 2
- Multi-threading approach implementation HOT 3
- Suggestion: replace needletail by noodles and niffler HOT 1
- `HashSet` is slow HOT 3
- Compress output when requested HOT 5
- Input parameter for number of bases in addition to coverage and genome size HOT 6
- Calculate genome size from an index file HOT 1
- Add number and fraction options
- Docker image issue HOT 9
- Suggestion: Raise warning when desired level of coverage is not possible HOT 1
- Feature: Min/max coverage threshold HOT 2
- Multi-threading approach HOT 7
- Does rasusa outputs reads that still cover the entire original genome? HOT 1
- Random sampling based on bases for the metagenomic dataset HOT 1
- illumina read input error HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rasusa.