stjude-rust-labs / fq Goto Github PK
View Code? Open in Web Editor NEWCommand line utility for manipulating Illumina-generated FASTQ files.
License: MIT License
Command line utility for manipulating Illumina-generated FASTQ files.
License: MIT License
I'd love to write a MultiQC module for this, but in order to facilitate this, would it be possible to write outputs in a JSON format for easier parsing?
e.g. FASTP does this, so we can directly load the metrics to show them in a MultiQC report. I'm working on a small validator workflow that includes fq_lint and it works nicely, but making the metrics directly accessible in a larger table would be really cool.
https://github.com/MultiQC/MultiQC/blob/main/multiqc/modules/fastp/fastp.py as an example how the JSON could look like.
https://github.com/MultiQC/test-data/tree/main/data/modules/fastp
Not too familiar with rust, otherwise I'd have opened a PR already ;-)
Hi,
It seems fq lint
fails on large FASTQ file due to below error:
fq lint XXX_1.merged.fastq.gz XXX_2.merged.fastq.gz > XXX.fq_lint.log.txt
Command error:
Error: I/O error
Caused by:
corrupt deflate stream
The file size is more than 50GB per file.
See fastq file format:
https://en.wikipedia.org/wiki/FASTQ_format
INFO : Single Read Validators: 4
DEBUG: - S001 => PluslineValidator
DEBUG: - S002 => AlphabetValidator
DEBUG: - S003 => ReadnameValidator
DEBUG: - S004 => CompleteReadValidator
INFO : Paired Read Validators: 1
DEBUG: - P001 => PairedReadnameValidator
INFO : Started reading from files...
Traceback (most recent call last):
File "./bin/fqlint", line 91, in <module>
for readno, (read_r1, read_r2) in enumerate(pair, 1):
File "fqlib/fastq.pyx", line 245, in fqlib.fastq.PairedFastQFiles.__next__
result = self.next_readpair()
File "fqlib/fastq.pyx", line 271, in fqlib.fastq.PairedFastQFiles.next_readpair
raise PairedReadValidationError(
fqlib.error.PairedReadValidationError: Read Pair Number: 0
Read 1
- File: R1.fastq
- Line Number: 0
- Readname: b'@K00202:217:HTLHYBBXX:2:1101:2169:998 1:N:0:NTCATTCC'
Read 2
- File: R2.fastq
- Line Number: 0
- Readname: b'@K00202:217:HTLHYBBXX:2:1101:2169:998 2:N:0:NTCATTCC'
---
Read names do not match.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./bin/fqlint", line 96, in <module>
break
TypeError: __exit__() takes exactly one argument (4 given)
Hi,
I'm wondering what you think about adding an option to subsample a FastQ file to a specific number of reads? This comes in handy when subsampling many FastQ files to a lowest common number of reads, for example. seqtk
offers this option.
The reason I'm asking here is that memory consumption of seqtk
can be quite high and I was wondering if you know that fq
has a smaller footprint.
Thank you for your input.
Using the version installed via cargo install --git https://github.com/stjude/fqlib.git --tag v0.7.0
I cannot get the program to report its version. Running fq lint -V
I only get fq-lint
as output with no version info. I'm on Mac Big Sur 11.6.
-r, --record-count <RECORD_COUNT>
should be -n, --record-count <RECORD_COUNT>
right?
Elsewhere in the README, -n
is mentioned.
Hi again ๐
I was wondering if a conversion utility from FASTQ to (unaligned) BAM/SAM/CRAM is something you had thought about and is in scope for this project? The alternative tools that I know for doing this (picard and fgbio) are both big collections with huge dependencies so it would be nice to have a slim Rust-based tool for doing so. (I realize that this request might make fq less slim and add more dependencies.)
Thanks for your thoughts.
I would love to be able to filter based on reads e.g. keep only reads starting with "GC".
Would be cool to be able to do this by regex.
Thank you for your work. I am a user of fq lint
and fq subsample
.
$ python setup.py install
Traceback (most recent call last):
File "setup.py", line 4, in <module>
from Cython.Build import cythonize
ImportError: No module named Cython.Build
When I try to install on mac I get errors.
error: failed to parse manifest at `/Users/rorynolan/Downloads/fq/Cargo.toml`
Caused by:
feature `edition2021` is required
The package requires the Cargo feature called `edition2021`, but that feature is not stabilized in this version of Cargo (1.55.0).
Consider adding `cargo-features = ["edition2021"]` to the top of Cargo.toml (above the [package] table) to tell Cargo you are opting in to use this unstable feature.
See https://doc.rust-lang.org/nightly/cargo/reference/unstable.html#edition-2021 for more information about the status of this feature.
Caused by:
feature `edition2021` is required
The package requires the Cargo feature called `edition2021`, but that feature is not stabilized in this version of Cargo (1.55.0).
Consider adding `cargo-features = ["edition2021"]` to the top of Cargo.toml (above the [package] table) to tell Cargo you are opting in to use this unstable feature.
See https://doc.rust-lang.org/nightly/cargo/reference/unstable.html#edition-2021 for more information about the status of this feature.
When I follow those instructions, I get a different error
error: failed to compile `fq v0.8.0 (/Users/rorynolan/Downloads/fq)`, intermediate artifacts can be found at `/Users/rorynolan/Downloads/fq/target`
Caused by:
failed to select a version for the requirement `tracing-subscriber = "^0.3.0"`
candidate versions found which didn't match: 0.2.25, 0.2.24, 0.2.23, ...
location searched: crates.io index
required by package `fq v0.8.0 (/Users/rorynolan/Downloads/fq)`
Currently, the filter
command only works on one input and writes a single output to stdout. It should also work with multiple segments and write multiple filtered outputs.
This was suggested by @rorynolan in #29 (comment).
$ pip3 install fqlib
$ fqlint
Traceback (most recent call last):
File "/usr/local/bin/fqlint", line 7, in <module>
from fqlib import PairedFastQFiles, SingleReadValidationError, PairedReadValidationError
File "/Users/edavis5/workspace/fqlib/fqlib/__init__.py", line 2, in <module>
from .fastq import *
ModuleNotFoundError: No module named 'fqlib.fastq'
Hi friends!
This is more of a feature request than an issue.
Have you considered implementing fastq generation from a reference genome? I think it would be a useful feature with a lot of extensibility. It might be out of scope, but I think there is a lot someone could do with this (I can imagine bed / vcf support, etc).
Given the discussions in our last development meeting, I think we need a higher performing strategy than C++ strings but need a little bit more functionality than plain C strings. In particular, we want to have a strategy to cache the length and have some utility methods that are included in the C++ string implementation. Seems like string_view
is a great solution: it was added in c++17, but I think that's fair game here.
References
I have read pairs in FASTQ files with ~36 M reads. When I use fq
to subsample the decompressed version everything works as expected (takes about 1 m 40 s for 9.5 GB pairs on my system).
time fq subsample -s 100 -n 10000000 --r1-dst sample_S1_1.fastq --r2-dst sample_S1_2.fastq sample_1.fastq sample_2.fastq
2022-02-14T13:34:25.823293Z INFO fq::commands::subsample: fq-subsample start
2022-02-14T13:34:25.823313Z INFO fq::commands::subsample: initializing rng from seed = 100
2022-02-14T13:34:25.823316Z INFO fq::commands::subsample: counting records
2022-02-14T13:35:07.795301Z INFO fq::commands::subsample: r1_src record count = 36903621
2022-02-14T13:35:07.846999Z INFO fq::commands::subsample: building filter
2022-02-14T13:35:07.902868Z INFO fq::commands::subsample: sampling paired end reads
2022-02-14T13:36:07.533471Z INFO fq::commands::subsample: sampled 10000000/36903621 (27.1%) records
2022-02-14T13:36:07.533736Z INFO fq::commands::subsample: fq-subsample end
real 1m41.727s
user 0m5.255s
sys 0m16.055s
However, if I operate directly on gzip compressed versions of the files, the record count is wrong and it takes forever. For the same files as above just compressed, I terminated the process after 20 h.
time fq subsample -s 100 -n 10000000 --r1-dst sample_S1_1.fastq.gz --r2-dst sample_S1_2.fastq.gz sample_1.fastq.gz sample_2.fastq.gz
2022-02-14T13:40:03.015235Z INFO fq::commands::subsample: fq-subsample start
2022-02-14T13:40:03.015258Z INFO fq::commands::subsample: initializing rng from seed = 100
2022-02-14T13:40:03.015263Z INFO fq::commands::subsample: counting records
2022-02-14T13:40:06.577680Z INFO fq::commands::subsample: r1_src record count = 1898334
2022-02-14T13:40:06.580380Z INFO fq::commands::subsample: building filter
Terminated
real 1193m1.514s
user 1192m26.595s
sys 0m0.419s
Hi,
I get an unexpected error for NamesValidator
:
$ fq lint Auto_C1_1_val_1.fq.gz_unmapped_reads_1.fq.gz Auto_C1_2_val_2.fq.gz_unmapped_reads_2.fq.gz
2023-12-07T03:16:57.574994Z INFO fq::commands::lint: fq-lint start
2023-12-07T03:16:57.575069Z INFO fq::commands::lint: validating paired end reads
2023-12-07T03:16:57.575108Z INFO fq::validators: disabled validators: []
2023-12-07T03:16:57.575116Z INFO fq::validators: enabled single read validators: ["[S003] NameValidator", "[S004] CompleteValidator", "[S002] AlphabetValidator", "[S001] PlusLineValidator", "[S005] ConsistentSeqQualValidator", "[S006] QualityStringValidator"]
2023-12-07T03:16:57.575134Z INFO fq::validators: enabled paired read validators: ["[P001] NamesValidator"]
2023-12-07T03:16:57.575149Z INFO fq::commands::lint: enabled special validators: ["[S007] DuplicateNameValidator"]
2023-12-07T03:16:57.575154Z INFO fq::commands::lint: starting validation (pass 1)
Auto_C1_1_val_1.fq.gz_unmapped_reads_1.fq.gz:1:1: [P001] NamesValidator: names mismatch: expected '@A01726:43:HTWT7DSX7:3:1101:12735:1000_2:N:0:ACACGGTT+TGGTTCGA', got '@A01726:43:HTWT7DSX7:3:1101:12735:1000_1:N:0:ACACGGTT+TGGTTCGA'
This is due to that the underscore _
in the read name is not recognized, which is added by Bismark
during processing as explained here. The validator parses the read name until the first space and in this case the space is replaced with underscore hence they don't match between R1
and R2
.
I believe the code needs to be fixed to handle such cases.
Hey guys,
thx for this cool tool!
I just stumbled over a weird error (using a Julia-Program).
ERROR: LoadError: ArgumentError: malformed FASTQ file
Stacktrace:
[1] read!(rdr::FASTX.FASTQ.Reader{TranscodingStreams.NoopStream{BufferedStreams.BufferedInputStream{IOStream}}}, rec::FASTX.FASTQ.Record)
...
Hence, I used your tool to validate my FASTQ-files
However, also your tool assured me that my FASTQ-files were valid.
After some time I figured out that a final newline character was missing in my FASTQ-files.
Thus, my suggestion for this tool is to include also a warning/error for such cases.
object.exit(self, exc_type, exc_value, traceback)
Suggest changing exit of Timer to def exit(self, *args):
(py36) [lding@L170687 fqlib]$ ./bin/fqlint -s high ./example/03_duplicate_reads/duplicate_in_R1/R1.fastq ./example/03_duplicate_reads/duplicate_in_R1/R2.fastq
INFO : Single Read Validators: 4
DEBUG: - S001 => PluslineValidator
DEBUG: - S002 => AlphabetValidator
DEBUG: - S003 => ReadnameValidator
DEBUG: - S004 => CompleteReadValidator
INFO : Paired Read Validators: 1
DEBUG: - P001 => PairedReadnameValidator
INFO : Started reading from files...
Traceback (most recent call last):
File "./bin/fqlint", line 91, in
for readno, (read_r1, read_r2) in enumerate(pair, 1):
File "fqlib/fastq.pyx", line 245, in fqlib.fastq.PairedFastQFiles.next
result = self.next_readpair()
File "fqlib/fastq.pyx", line 271, in fqlib.fastq.PairedFastQFiles.next_readpair
raise PairedReadValidationError(
fqlib.error.PairedReadValidationError: Read Pair Number: 0
Read 1
Read names do not match.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./bin/fqlint", line 96, in
break
TypeError: exit() takes exactly one argument (4 given)
Both build/
and dist/
do not necessarily exist when make
is run, setting their respective variables BUILDDIR
and DISTDIR
to empty strings. When invoking the clean
target, it effectively executes rm -rf /*
.
$ cd /tmp
$ git clone https://github.com/stjude/fqlib.git
$ cd fqlib
$ make --dry-run clean
rm -rf /tmp/fqlib/fqlib/*.c /tmp/fqlib/fqlib/*.cpp /tmp/fqlib/fqlib/*.html /tmp/fqlib/fqlib/*.so /tmp/fqlib/fqlib/__pycache__ /* /*
Note the final two arguments in the output.
I recently figured out a minor misspelt flag in fq lint README
https://github.com/stjude-rust-labs/fq#validators
The README for Validator section states:
"Validate includes a set of validators that run on single or paired records. By default, records are validated with all rules, but validators can be disabled using --disable-valdiator
CODE, where CODE is one of validators listed below."
Here, the --disable-valdiator
flag is misspelled and needs to corrected to --disable-validator
to avoid inconvenience to new users.
Cheers!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.