mbhall88 / rasusa Goto Github PK

View Code? Open in Web Editor NEW

189.0 8.0 16.0 13.66 MB

Randomly subsample sequencing reads

Home Page: https://doi.org/10.21105/joss.03941

License: MIT License

Rust 88.69% Dockerfile 0.44% Shell 10.60% Just 0.27%

bioinformatics fastq subsampling coverage fasta genome-analysis random downsample rust alignment

rasusa's Introduction

Randomly subsample sequencing reads or alignments.

Hall, M. B., (2022). Rasusa: Randomly subsample sequencing reads to a specified coverage. Journal of Open Source Software, 7(69), 3941, https://doi.org/10.21105/joss.03941

Motivation
Install
Usage
Benchmark
- Single long read input
- Paired-end input
Contributing
Citing
- Bibtex

Motivation

I couldn't find a tool for subsampling fastq reads that met my requirements. All the strategies I could find fell short as they either just wanted a number or percentage of reads to subsample to or, if they did subsample to a coverage, they assume all reads are the same size (i.e Illumina). As I mostly work with long-read data this posed a problem if I wanted to subsample a file to certain coverage, as length of reads was never taken into account. rasusa addresses this shortcoming.

A workaround I had been using for a while was using filtlong. It was simple enough, I just figure out the number of bases I need to achieve a (theoretical) coverage for my sample. Say I have a fastq from an E. coli sample with 5 million reads and I want to subset it to 50x coverage. I just need to multiply the expected size of the sample's genome, 4.6 million base pairs, by the coverage I want and I have my target bases - 230 million base pairs. In filtlong, I can do the following

target=230000000
filtlong --target_bases "$target" reads.fq > reads.50x.fq

However, this is technically not the intended function of filtlong; it's a quality filtering tool. What you get in the end is a subset of the "highest scoring" reads at a (theoretical) coverage of 50x. Depending on your circumstances, this might be what you want. However, you bias yourself towards the best/longest reads in the dataset - not a fair representation of your dataset as a whole. There is also the possibility of favouring regions of the genome that produce longer/higher quality reads. De Maio et al. even found that by randomly subsampling nanopore reads you achieve better genome assemblies than if you had filtered.

So, depending on your circumstances, an unbiased subsample of your reads might be what you need. And if this is the case, rasusa has you covered.

Install

tl;dr: precompiled binary

curl -sSL rasusa.mbh.sh | sh
# or with wget
wget -nv -O - rasusa.mbh.sh | sh

You can also pass options to the script like so

$ curl -sSL rasusa.mbh.sh | sh -s -- --help
install.sh [option]

Fetch and install the latest version of rasusa, if rasusa is already
installed it will be updated to the latest version.

Options
        -V, --verbose
                Enable verbose output for the installer

        -f, -y, --force, --yes
                Skip the confirmation prompt during installation

        -p, --platform
                Override the platform identified by the installer [default: apple-darwin]

        -b, --bin-dir
                Override the bin installation directory [default: /usr/local/bin]

        -a, --arch
                Override the architecture identified by the installer [default: x86_64]

        -B, --base-url
                Override the base URL used for downloading releases [default: https://github.com/mbhall88/ssubmit/releases]

        -h, --help
                Display this help message

`cargo`

Prerequisite: rust toolchain (min. v1.74.1)

cargo install rasusa

`conda`

Prerequisite: conda (and bioconda channel correctly set up)

conda install rasusa

Thank you to Devon Ryan (@dpryan79) for help debugging the bioconda recipe.

Container

Docker images are hosted at quay.io. For versions 0.3.0 and earlier, the images were hosted on Dockerhub.

`singularity`

Prerequisite: singularity

URI="docker://quay.io/mbhall88/rasusa"
singularity exec "$URI" rasusa --help

The above will use the latest version. If you want to specify a version then use a tag (or commit) like so.

VERSION="0.8.0"
URI="docker://quay.io/mbhall88/rasusa:${VERSION}"

`docker`

Prerequisite: docker

docker pull quay.io/mbhall88/rasusa
docker run quay.io/mbhall88/rasusa --help

You can find all the available tags on the quay.io repository. Note: versions prior to 0.4.0 were housed on Docker Hub.

Build locally

Prerequisite: rust toolchain

git clone https://github.com/mbhall88/rasusa.git
cd rasusa
cargo build --release
target/release/rasusa --help
# if you want to check everything is working ok
cargo test --all

Usage

Basic usage - reads

Subsample fastq reads

rasusa reads --coverage 30 --genome-size 4.6mb in.fq

The above command will output the subsampled file to stdout.

Or, if you have paired Illumina

rasusa --coverage 30 --genome-size 4g -o out.r1.fq -o out.r2.fq r1.fq r2.fq

For more details on the above options, and additional options, see below.

Basic usage - alignments

Subsample alignments

rasusa aln --coverage 30 in.bam | samtools sort -o out.bam

this will subsample each position in the alignment to 30x coverage.

Required parameters

There are three required options to run rasusa reads.

Input

This positional argument specifies the file(s) containing the reads or alignments you would like to subsample. The file(s) must be valid fasta or fastq format for the reads command and can be compressed (with a tool such as gzip). For the aln command, the file must be a valid indexed SAM/BAM file.
If two files are passed to reads, rasusa will assume they are paired-end reads.

Bash wizard tip 🧙: Let globs do the work for you r*.fq

Coverage

`-c`, `--coverage`

Not required if --bases is present for reads

This option is used to determine the minimum coverage to subsample the reads to. For the reads command, it can be specified as an integer (100), a decimal/float (100.7), or either of the previous suffixed with an 'x' (100x). For the aln command, it is an integer only.

Due to the method for determining how many bases are required to achieve the desired coverage in the reads command, the actual coverage, in the end, could be slightly higher than requested. For example, if the last included read is very long. The log messages should inform you of the actual coverage in the end.

For the aln command, the coverage is the minimum number of reads that should be present at each position in the alignment. If a position has fewer than the requested number of reads, all reads at that position will be included. In addition, there will be (small) regions with more than the requested number of reads - usually localised to where the alignment of a read ends. This is because when the alignment of a selected read ends, the next read is selected based on it spanning the end of the previous alignment. When selecting this next alignment, we preference alignments whose start is closest to the end of the previous alignment, ensuring minimal overlap with the previous alignment. See the below screenshot from IGV for a visual example.

Genome size

`-g`, `--genome-size`

Not valid for aln

Not required if --bases is present for reads

The genome size of the input is also required. It is used to determine how many bases are necessary to achieve the desired coverage. This can, of course, be as precise or rough as you like.
Genome size can be passed in many ways. As a plain old integer (1600), or with a metric suffix (1.6kb). All metric suffixes can have an optional 'b' suffix and be lower, upper, or mixed case. So 'Kb', 'kb' and 'k' would all be inferred as 'kilo'. Valid metric suffixes include:

Base (b) - multiplies by 1
Kilo (k) - multiplies by 1,000
Mega (m) - multiplies by 1,000,000
Giga (g) - multiplies by 1,000,000,000
Tera (t) - multiplies by 1,000,000,000,000

Alternatively, a FASTA/Q index file can be given and the genome size will be set to the sum of all reference sequences in it.

Optional parameters

Output

`-o`, `--output`

reads

NOTE: This parameter is required if passing paired Illumina data to reads.

By default, rasusa will output the subsampled file to stdout (if one file is given). If you would prefer to specify an output file path, then use this option.

Output for Illumina paired files must be specified using --output twice - -o out.r1.fq -o out.r2.fq

The ordering of the output files is assumed to be the same as the input.
Note: The output will always be in the same format as the input. You cannot pass fastq as input and ask for fasta as output.

rasusa reads will also attempt to automatically infer whether compression of the output file(s) is required. It does this by detecting any of the supported extensions:

.gz: will compress the output with gzip
.bz or .bz2: will compress the output with bzip2
.lzma: will compress the output with the xz LZMA algorithm

aln

For the aln command, the output file format will be the same as the input if writing to stdout, otherwise it will be inferred from the file extension.

Note: the output alignment will most likely not be sorted. You can use samtools sort to sort the output. e.g.,

rasusa aln -c 5 in.bam | samtools sort -o out.bam

Output compression/format

`-O`, `--output-type`

reads

Use this option to manually set the compression algoritm to use for the output file(s). It will override any format automatically detected from the output path.

Valid options are:

g: gzip
b: bzip2
l or x: xz LZMA algorithm
z: zstd
u: no compression

aln

Use this option to manually set the output file format. By default, the same format as the input will be used, or the format will be guessed from the --output path extension if given. Valid options are:

b or bam: BAM
c or cram: CRAM
s or sam: SAM

Note: all values to this option are case insensitive.

Compresion level

`-l`, `--compress-level`

Compression level to use if compressing the output. By default this is set to the default for the compression type being output.

Target number of bases

`-b`, `--bases`

reads only

Explicitly set the number of bases required in the subsample. This option takes the number in the same format as genome size.

Note: if this option is given, genome size and coverage are not required, or ignored if they are provided.

Number of reads

`-n`, `--num`

reads only

Explicitly set the number of reads in the subsample. This option takes the number in the same format as genome size.

When providing paired reads as input, this option will sample this many total read pairs. For example, when passing -n 20 r1.fq r2.fq, the two output files will have 20 reads each, and the read ids will be the same in both.

Note: if this option is given, genome size and coverage are not required.

Fraction of reads

`-f`, `--frac`

reads only

Explicitly set the fraction of total reads in the subsample. The value given to this option can be a float or a percentage - i.e., -f 0.5 and -f 50 will both take half of the reads.

Note: if this option is given, genome size and coverage are not required.

Random seed

`-s`, `--seed`

This option allows you to specify the random seed used by the random subsampler. By explicitly setting this parameter, you make the subsample for the input reproducible. The seed is an integer, and by default it is not set, meaning the operating system will seed the random subsampler. You should only pass this parameter if you are likely to want to subsample the same input file again in the future and want the same subset of reads.

Verbosity

`-v`

Adding this optional flag will make the logging more verbose. By default, logging will produce messages considered "info" or above (see here for more details). If verbosity is switched on, you will additionally get "debug" level logging messages.

Full usage

$ rasusa --help
Randomly subsample reads or alignments

Usage: rasusa [OPTIONS] <COMMAND>

Commands:
  reads  Randomly subsample reads
  aln    Randomly subsample alignments to a specified depth of coverage
  cite   Get a bibtex formatted citation for this package
  help   Print this message or the help of the given subcommand(s)

Options:
  -v             Switch on verbosity
  -h, --help     Print help
  -V, --version  Print version

`reads` command

$ rasusa reads --help
Randomly subsample reads

Usage: rasusa reads [OPTIONS] <FILE(S)>...

Arguments:
  <FILE(S)>...
          The fast{a,q} file(s) to subsample.

          For paired Illumina, the order matters. i.e., R1 then R2.

Options:
  -o, --output <OUTPUT>
          Output filepath(s); stdout if not present.

          For paired Illumina pass this flag twice `-o o1.fq -o o2.fq`

          NOTE: The order of the pairs is assumed to be the same as the input - e.g., R1 then R2. This option is required for paired input.
          
  -g, --genome-size <size|faidx>
          Genome size to calculate coverage with respect to. e.g., 4.3kb, 7Tb, 9000, 4.1MB

          Alternatively, a FASTA/Q index file can be provided and the genome size will be set to the sum of all reference sequences.

          If --bases is not provided, this option and --coverage are required

  -c, --coverage <FLOAT>
          The desired depth of coverage to subsample the reads to

          If --bases is not provided, this option and --genome-size are required

  -b, --bases <bases>
          Explicitly set the number of bases required e.g., 4.3kb, 7Tb, 9000, 4.1MB

          If this option is given, --coverage and --genome-size are ignored

  -n, --num <INT>
          Subsample to a specific number of reads

          If paired-end reads are passed, this is the number of (matched) reads from EACH file. This option accepts the same format as genome size - e.g., 1k will take 1000 reads

  -f, --frac <FLOAT>
          Subsample to a fraction of the reads - e.g., 0.5 samples half the reads

          Values >1 and <=100 will be automatically converted - e.g., 25 => 0.25

  -s, --seed <INT>
          Random seed to use

  -v
          Switch on verbosity

  -O, --output-type <u|b|g|l|x|z>
          u: uncompressed; b: Bzip2; g: Gzip; l: Lzma; x: Xz (Lzma); z: Zstd

          Rasusa will attempt to infer the output compression format automatically from the filename extension. This option is used to override that. If writing to stdout, the default is uncompressed

  -l, --compress-level <1-21>
          Compression level to use if compressing output. Uses the default level for the format if not specified

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

`aln` command

$ rasusa aln --help
Randomly subsample alignments to a specified depth of coverage

Usage: rasusa aln [OPTIONS] --coverage <INT> <FILE>

Arguments:
  <FILE>
          Path to the indexed alignment file (SAM/BAM/CRAM) to subsample

Options:
  -o, --output <FILE>
          Path to the output subsampled alignment file. Defaults to stdout (same format as input)

          The output is not guaranteed to be sorted. We recommend piping the output to `samtools sort`

  -O, --output-type <FMT>
          Output format. Rasusa will attempt to infer the format from the output file extension if not provided

  -c, --coverage <INT>
          The desired depth of coverage to subsample the alignment to

  -s, --seed <INT>
          Random seed to use

      --step-size <INT>
          When a region has less than the desired coverage, the step size to move along the chromosome to find more reads.

          The lowest of the step and the minimum end coordinate of the reads in the region will be used. This parameter can have a significant impact on the runtime of the subsampling process.

          [default: 100]

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

Benchmark

“Time flies like an arrow; fruit flies like a banana.”
― Anthony G. Oettinger

The real question is: will rasusa just needlessly eat away at your precious time on earth?

To do this benchmark, I am going to use hyperfine.

The data I used comes from

Bainomugisa, Arnold, et al. "A complete high-quality MinION nanopore assembly of an extensively drug-resistant Mycobacterium tuberculosis Beijing lineage strain identifies novel variation in repetitive PE/PPE gene regions." Microbial genomics 4.7 (2018).

Note, these benchmarks are for reads only as there is no other tool that replicates the functionality of aln.

Single long read input

Download and rename the fastq

URL="ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR649/008/SRR6490088/SRR6490088_1.fastq.gz"
wget "$URL" -O - | gzip -d -c > tb.fq

The file size is 2.9G, and it has 379,547 reads.
We benchmark against filtlong using the same strategy outlined in Motivation.

TB_GENOME_SIZE=4411532
COVG=50
TARGET_BASES=$(( TB_GENOME_SIZE * COVG ))
FILTLONG_CMD="filtlong --target_bases $TARGET_BASES tb.fq"
RASUSA_CMD="rasusa reads tb.fq -c $COVG -g $TB_GENOME_SIZE -s 1"
hyperfine --warmup 3 --runs 10 --export-markdown results-single.md \
     "$FILTLONG_CMD" "$RASUSA_CMD"

Results

Command	Mean [s]	Min [s]	Max [s]	Relative
`filtlong --target_bases 220576600 tb.fq`	21.685 ± 0.055	21.622	21.787	21.77 ± 0.29
`rasusa reads tb.fq -c 50 -g 4411532 -s 1`	0.996 ± 0.013	0.983	1.023	1.00

Summary: rasusa ran 21.77 ± 0.29 times faster than filtlong.

Paired-end input

Download and then deinterleave the fastq with pyfastaq

URL="ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR648/008/SRR6488968/SRR6488968.fastq.gz"
wget "$URL" -O - | gzip -d -c - | fastaq deinterleave - r1.fq r2.fq

Each file's size is 179M and has 283,590 reads.
For this benchmark, we will use seqtk. We will also test seqtk's 2-pass mode as this is analogous to rasusa reads.

NUM_READS=140000
SEQTK_CMD_1="seqtk sample -s 1 r1.fq $NUM_READS > /tmp/r1.fq; seqtk sample -s 1 r2.fq $NUM_READS > /tmp/r2.fq;"
SEQTK_CMD_2="seqtk sample -2 -s 1 r1.fq $NUM_READS > /tmp/r1.fq; seqtk sample -2 -s 1 r2.fq $NUM_READS > /tmp/r2.fq;"
RASUSA_CMD="rasusa reads r1.fq r2.fq -n $NUM_READS -s 1 -o /tmp/r1.fq -o /tmp/r2.fq"
hyperfine --warmup 10 --runs 100 --export-markdown results-paired.md \
     "$SEQTK_CMD_1" "$SEQTK_CMD_2" "$RASUSA_CMD"

Results

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`seqtk sample -s 1 r1.fq 140000 > /tmp/r1.fq; seqtk sample -s 1 r2.fq 140000 > /tmp/r2.fq;`	907.7 ± 23.6	875.4	997.8	1.84 ± 0.62
`seqtk sample -2 -s 1 r1.fq 140000 > /tmp/r1.fq; seqtk sample -2 -s 1 r2.fq 140000 > /tmp/r2.fq;`	870.8 ± 54.9	818.2	1219.8	1.77 ± 0.61
`rasusa reads r1.fq r2.fq -n 140000 -s 1 -o /tmp/r1.fq -o /tmp/r2.fq`	492.2 ± 165.4	327.4	887.4	1.00

Summary: rasusa reads ran 1.84 times faster than seqtk (1-pass) and 1.77 times faster than seqtk (2-pass)

So, rasusa reads is faster than seqtk but doesn't require a fixed number of reads - allowing you to avoid doing maths to determine how many reads you need to downsample to a specific coverage. 🤓

Contributing

If you would like to help improve rasusa you are very welcome!

For changes to be accepted, they must pass the CI and coverage checks. These include:

Code is formatted with rustfmt. This can be done by running cargo fmt in the project directory.
There are no compiler errors/warnings. You can check this by running cargo clippy --all-features --all-targets -- -D warnings
Code coverage has not reduced. If you want to check coverage before pushing changes, I use kcov.

Citing

If you use rasusa in your research, it would be very much appreciated if you could cite it.

Hall, M. B., (2022). Rasusa: Randomly subsample sequencing reads to a specified coverage. Journal of Open Source Software, 7(69), 3941, https://doi.org/10.21105/joss.03941

Bibtex

You can get the following citation by running rasusa cite

@article{Hall2022,
  doi = {10.21105/joss.03941},
  url = {https://doi.org/10.21105/joss.03941},
  year = {2022},
  publisher = {The Open Journal},
  volume = {7},
  number = {69},
  pages = {3941},
  author = {Michael B. Hall},
  title = {Rasusa: Randomly subsample sequencing reads to a specified coverage},
  journal = {Journal of Open Source Software}
}

rasusa's People

Contributors

Stargazers

Watchers

Forkers

pythseq esteinig vikash84 dyxmvp genomicsnx natir jianshu93 jfear planetypus ivanv87 iq-scm sky-fisch mayhemheroes mattheww95 so-much-void randolium

rasusa's Issues

Calculate target total length

This is the total number of bases we want to output in the subsampled sequence file. Equates to coverage * genome size

Add number and fraction options

This will allow rasusa to mirror seqtk functionality.

Number referring to a specific number of reads in the output and fraction being a given proportion of the read file or a seperate option for fraction of total bases.

Multi-threading approach

Hi,
This is a very useful program but it is taking long time to sub-sample from a large fastq file. I am running it on a server and would like to run it using multi-threading but I am novice to programming and not sure how to do that. Any help please?
Thanks,

Iterate over input file and output required reads

Re-iterate over the input file and write reads to the output file if they are in our subsampled list.

This subsampled list should be something like a set or bitvector to allow for constant lookup time as we will do this check for every read in the input file, which could be up to tens-of-millions.

add docker option

Add a recipe for Docker and also a deploy step in the CI to Docker Hub

Finer control over coverage

Hi,

Any float input to coverage is automatically rounded down to the lower integer. We are testing in a coverage regime where we want to inspect the finer effects of coverage. Is it possible to have a coverage of exactly (or approximately exactly) what we input into -c?

Thanks!

Compress output when requested

Hey @natir.

Yes, I only just switched the parsing to needletail. Given I only just switched the parser I probably won't get around to switching it again anytime soon. Also, compile time isn't a major concern for me, especially since I distribute pre-compiled binaries and a bunch of other methods that mean users don't need to compile the project. I'm happy to review a PR with updated benchmark though.

Regarding niffler, you've made me realise somewhere along the line I have lost the compressed output functionality of this tool... Originally rasusa would infer the desired output compression from the path. I'll have to fix that.

Originally posted by @mbhall88 in #25 (comment)

It might also be a good idea to add a flag to allow the user to set the compression level also.

Create bioconda recipe

With this comes the implicit container from biocontainers.

Feature: Min/max coverage threshold

Hi, thank you for the great tool!

Sometimes, we need to subsample long reads (e.g HiFi) to retain only reads that are at least x times present in the reads set.
Let's say I need to keep only reads with at least 40x coverage. This is particularly crucial to automatically exclude all erroneous reads/k-mers with potential contaminations. Because errors and contaminations do not usually have deep coverage in the genome. That is why erroneous k-mers are always at the beginning of the k-mers histograms with a low depth, usually < 20. Having an option to subsample only reads with a min coverage would maybe exclude most erroneous reads and contaminations, hence improving the assembly for instance.

Error when using --output twice

hello,

I think there is a cli argument parsing bug in v1.0.0:

$ rasusa reads -n 10 -o sample1.fastq -o sample2.fastq read1.fastq read2.fastq

yields the following error:

Error: Bad combination of input and output files: Got more than 2 files for output.

n.b. specifying -o sample1.fastq sample2.fastq seems to work as expected

(Thank you for your work on rasusa!)

Add code docs

add singularity option

Add a singularity recipe and a deploy step to Singularity Library

Generating only half of the target depth

I used rasusa to subsample nanopore reads (200 to 600bp) by depth but I noticed that the observed average depths are usually half of the target depth. I used samtools depth to get the mean depth. Would you have an idea as to why this happens?

#sample
for i in 19;
do
#depth
for l in 5 10 20 30 100 400;
do

./rasusa -i $datadir/barcode${i}duplex.fastq.gz
--coverage ${l} --genome-size 243724 -s 100
-o $outdir/BC${i}${l}x.fq.gz

./rasusa -i $datadir/BC${i}duplex_CMVmapped.fastq.gz
--coverage ${l} --genome-size 243724 -s 100
-o $outdir/BC${i}${l}x_mapped.fq.gz

Target depth	Observed mean depth
5	2.92247
10	5.05997
20	9.52989
30	14.0369
100	46.1916
400	183.38

Appreciate your help.

Add Licence

Suggestion: Raise warning when desired level of coverage is not possible

Hi,

This is with respect to openjournals/joss-reviews#3941 .

The tool works perfectly for the designed use case. However I think it'd be great to raise a warning when the desired level of coverage is not possible, for example when the fraction of reads to the size of the genome is not large enough.

Relationship to filtlong

FYI - a comment from a colleague:

You can still use filtlong with the below settings to 
focus on quality only and to more or less ignore length in the scoring metric

--min_length 500
--mean_q_weight 10
--length_weight 1
--target_bases $((DEPTH * GENOMESIZE))

Support for subsampling alignment to uniform coverage

Hey,

Great tool.

Are there any plans to support Bam files (which would then ideally output a downsampled bam file)?

At the moment, if I want to do this, I would need to:

convert BAM to fasta (using samtools fasta -F 4)
downsample with Rasusa
use the read ids in the downsampled fasta to filter my BAM

As such would be a lot easier if Rausa could support BAM files :)

add homebrew recipe

Error

Distribute on crates.io

Get read lengths for an input file

`HashSet` is slow

Hello,

During my small reading of rasusa code I see a possible speed up, but before made a pull request I prefer discuss it because we have multiple choices with different draw back.

Rasusa use a HashSet to store indices of selected read, we could find faster solution.

By default the rust hashing algorithm is resistance against HashDoS attacks, this could be useful in some situation but I think it's not the case for rasusa. This resistance have a cost in term of computation time, you could replace standard HashSet by rustc_hash and fxhash this crates provide drop in replacement Set struct but they are faster.

In fact we can replace HashSet by a Vec<bool> with a true or false if the current indices is selected or not. This way have an higher memory cost (1 bytes per indices), but I'm pretty sure this solution is faster. To reduce memory usage to 1 bit per indices we can use crate bitvec.

What is your opinion on this trouble and possible solution, if you want I can write benchmark to compare these solutions (I like benchmark).

Best

Calculate genome size from an index file

From #14 (comment)

Allow --genome-size to take a faidx file and sum the lengths of all contigs in that index to calculate the genome size.

Docker image issue

it appears that there is an issue with the docker container. When I run the command set on the README, I get:

docker run quay.io/mbhall88/rasusa:joss_paper
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: exec: "/bin/bash": stat /bin/bash: no such file or directory: unknown.
ERRO[0008] error waiting for container: context canceled

Random sampling based on bases for the metagenomic dataset

Dear Developers,

I am working on a comparative study, for which we are using short-read and long-reads metagenomic samples. We want to subsample our long reads dataset based on bases, For Eg, if our randomly sub-sampled short read metagenomic sample has 10,000 reads which make up 1,500,000 bases in total, we want to subsample our long read dataset to 1,500,000 bases. For short read, we used reformat.sh (to randomly subsample reads).

I searched tools for subsampling long reads to achieve the required bases. but none of them use random sampling, I might have missed some tools which can do it.

My question is will it be possible to do it by using Rasusa for the metagenomic dataset?

Thanks,
Anupam

Error: unable to gather read lengths for the first input file

Hi. I'm trying to subsample by depth and I'm getting this error:

./rasusa -i barcode05_duplex.fastq.gz --coverage 400 --genome-size 243724 -s 100 -o BC05_400x.fq.gz
[2023-10-02][13:20:16][rasusa][INFO] Target number of bases to subsample to is: 97489600
[2023-10-02][13:20:16][rasusa][INFO] Gathering read lengths...
Error: unable to gather read lengths for the first input file

Caused by:
0: Failed to parse record
1: Sequence length is 373 but quality length is 120 (record '5ad0b9e9-94a7-477c-b47f-0963e639d159' at line 1544857)

Thank you for your help.

Write README

Does rasusa outputs reads that still cover the entire original genome?

Hi,
When using rasusa for downsampling, do we get reads that still cover the whole genome but with a reduced depth, or is it possible that the reads are only sampled from certain regions (i.e. there is no guarantee that the original genome is well represented in the output)?
Thanks

illumina read input error

Hello, I'm trying to subsample Illumina paired-end data but keep getting an error. I also only get one of the files (sub_RHB02_R1_001.fastq) and it is empty (0 bytes). rasusa version used is 0.6.1 and the command used with the error is below. I'm not sure what exactly is the issue and how to proceed.

$ rasusa -i RHB02_R1_001.fastq.gz RHB02_R2_001.fastq.gz -b 651412654 -s 1 -o sub_RHB02_R1_001.fastq -o sub_RHB02_R2_001.fastq
[2022-11-16][13:15:02][rasusa][INFO] Two input files given. Assuming paired Illumina...
[2022-11-16][13:15:02][rasusa][INFO] Target number of bases to subsample to is: 651412654
[2022-11-16][13:15:02][rasusa][INFO] Gathering read lengths...
[2022-11-16][13:16:51][rasusa][INFO] Gathering read lengths for second input file...
[2022-11-16][13:18:19][rasusa][ERROR] First input has 92613254 reads, but the second has 92613254 reads. Paired Illumina files are assumed to have the same number of reads. The results of this subsample may not be as expected now.

Lower than expected coverage for illumina paired end reads

Hello,
I am using rasusa extensively in my illumina paired end pipelines. I noticed that sometimes, I get significantly lower coverage than the one I gave as a parameter. For example, I subsample to 250x and get 240x. This is somewhat of a problem for my applications.
From an anecdotic check, I see that this might have something to do with poor quality reads that I am quality trimming to minimum length with cutadapt. This results in a highly skewed read length distribution. Here is an example:

Can this issue be resolved within the scope of this project?
Thanks,
Ilya.

Input parameter for number of bases in addition to coverage and genome size

In addition to the ability of providing --coverage and --genome-size, an alternative usage mode where the user provides the total number of bases in downsampled out (e.g., --bases) would also be useful in certain use cases.

won't detect input file

I'm trying to use rasusa to downsample some Illumina short reads to a smaller coverage using this command here:

docker run quay.io/mbhall88/rasusa -i P2B_R1.fastq.gz P2B_R2.fastq.gz --coverage 5 --genome-size 12m -o P2B_5x_r1.fq.gz P2B_5x_r2.fq.gz

However, I keep getting this error message:

error: Invalid value "P2B_R1.fastq.gz" for '--input <INPUT>...': "P2B_R1.fastq.gz" does not exist

I installed the latest version of rasusa through Docker - following the commands listed on the GitHub page. I used a path at first to find the input files but it gave me that error message so I moved the files to my home directory where I run docker images but that didn't work either. If someone could help me figure out how to fix this error that would be great!

Multi-threading approach implementation

Hi @mbhall88,

Thanks for working on {rasusa} it's been very helpful. I was playing with it to generate multiple subsamples from one fq file and wonder if there is a way to implement a multi-threading approach for working locally or in a cluster. Maybe using {parallel}?

Cheers,
Camilo.

Randomly select reads until target length reached

Take the read lengths vector and keep sampling elements from it until we have a total length equal to that calculated in #2

Estimate genome size automatically

In my assembler shovill i estimate the genome size from kmer frequencies and use that to subsample the reads to a fixed coverage (100x) much like rasusa does:

https://github.com/tseemann/shovill/blob/master/bin/shovill#L145-L175

Would you consider adding --genome-size auto to rasusa ?

For nanopore data one would want k<=15 and for illumina higher is better, say 24-32.
For illumina, the number of kmers with freq > 2 is a good estimate normally.
Not sure about nanopore.

Support for paired-end reads?

Are there plans to support Illumina paired reads?

Add logging

Suggestion: replace needletail by noodles and niffler

Hello very nice work.

Needletail is very nice crate, but if I didn't made any mistake you use it only for fastx parsing, you didn't use any other functionality.

Noodles is crate provide many bioinformatics parser and system of functionality to get only what you need. By switch to noodles you can reduce the number of dependency of rasusa and speedup compilation time. I didn't made a full benchmark but noodles and needletail have almost same code.

If you want keep similar functionality (support compression) you need also add niffler, niffler provide a simple and transparent support for compressed files.

Again very nice work.

Question: there is a way to start from a bam file instead of a fastq?

As in the title. If I have a huge bam file to subsample, passing to fastq require a lot of space. Is there a way to skip this passage on the go?
Thanks

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

mbhall88 / rasusa Goto Github PK

rasusa's Introduction

Table of Contents

Motivation

Install

cargo

conda

Container

singularity

docker

Build locally

Usage

Basic usage - reads

Basic usage - alignments

Required parameters

Input

Coverage

-c, --coverage

Genome size

-g, --genome-size

Optional parameters

Output

-o, --output

Output compression/format

-O, --output-type

Compresion level

-l, --compress-level

Target number of bases

-b, --bases

Number of reads

-n, --num

Fraction of reads

-f, --frac

Random seed

-s, --seed

Verbosity

-v

Full usage

reads command

aln command

Benchmark

Single long read input

Results

Paired-end input

Results

Contributing

Citing

Bibtex

rasusa's People

Contributors

Stargazers

Watchers

Forkers

rasusa's Issues

Recommend Projects

Recommend Topics

Recommend Org

`cargo`

`conda`

`singularity`

`docker`

`-c`, `--coverage`

`-g`, `--genome-size`

`-o`, `--output`

`-O`, `--output-type`

`-l`, `--compress-level`

`-b`, `--bases`

`-n`, `--num`

`-f`, `--frac`

`-s`, `--seed`

`-v`

`reads` command

`aln` command