Coder Social home page Coder Social logo

stjude-rust-labs / fq Goto Github PK

View Code? Open in Web Editor NEW
69.0 15.0 5.0 472 KB

Command line utility for manipulating Illumina-generated FASTQ files.

License: MIT License

Rust 99.61% Dockerfile 0.39%
fastq bioinformatics genomics illumina next-generation-sequencing fastq-files rust

fq's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fq's Issues

JSON Metrics output

I'd love to write a MultiQC module for this, but in order to facilitate this, would it be possible to write outputs in a JSON format for easier parsing?

e.g. FASTP does this, so we can directly load the metrics to show them in a MultiQC report. I'm working on a small validator workflow that includes fq_lint and it works nicely, but making the metrics directly accessible in a larger table would be really cool.

https://github.com/MultiQC/MultiQC/blob/main/multiqc/modules/fastp/fastp.py as an example how the JSON could look like.

https://github.com/MultiQC/test-data/tree/main/data/modules/fastp

Not too familiar with rust, otherwise I'd have opened a PR already ;-)

`fq lint` fails on large files

Hi,

It seems fq lint fails on large FASTQ file due to below error:

fq lint XXX_1.merged.fastq.gz XXX_2.merged.fastq.gz > XXX.fq_lint.log.txt

Command error:
  Error: I/O error
  Caused by:
      corrupt deflate stream

The file size is more than 50GB per file.

Read name should exclude or ignore "the member of a pair"

See fastq file format:
https://en.wikipedia.org/wiki/FASTQ_format

INFO : Single Read Validators: 4
DEBUG:   - S001 => PluslineValidator
DEBUG:   - S002 => AlphabetValidator
DEBUG:   - S003 => ReadnameValidator
DEBUG:   - S004 => CompleteReadValidator
INFO : Paired Read Validators: 1
DEBUG:   - P001 => PairedReadnameValidator
INFO : Started reading from files...
Traceback (most recent call last):
  File "./bin/fqlint", line 91, in <module>
    for readno, (read_r1, read_r2) in enumerate(pair, 1):
  File "fqlib/fastq.pyx", line 245, in fqlib.fastq.PairedFastQFiles.__next__
    result = self.next_readpair()
  File "fqlib/fastq.pyx", line 271, in fqlib.fastq.PairedFastQFiles.next_readpair
    raise PairedReadValidationError(
fqlib.error.PairedReadValidationError: Read Pair Number: 0
Read 1
   - File: R1.fastq
   - Line Number: 0
   - Readname: b'@K00202:217:HTLHYBBXX:2:1101:2169:998 1:N:0:NTCATTCC'
Read 2
   - File: R2.fastq
   - Line Number: 0
   - Readname: b'@K00202:217:HTLHYBBXX:2:1101:2169:998 2:N:0:NTCATTCC'
---
Read names do not match.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./bin/fqlint", line 96, in <module>
    break
TypeError: __exit__() takes exactly one argument (4 given)

Subsample to a fixed number of reads

Hi,

I'm wondering what you think about adding an option to subsample a FastQ file to a specific number of reads? This comes in handy when subsampling many FastQ files to a lowest common number of reads, for example. seqtk offers this option.

The reason I'm asking here is that memory consumption of seqtk can be quite high and I was wondering if you know that fq has a smaller footprint.

Thank you for your input.

Version reporting not working

Using the version installed via cargo install --git https://github.com/stjude/fqlib.git --tag v0.7.0 I cannot get the program to report its version. Running fq lint -V I only get fq-lint as output with no version info. I'm on Mac Big Sur 11.6.

Docs typo

-r, --record-count <RECORD_COUNT>
should be -n, --record-count <RECORD_COUNT> right?
Elsewhere in the README, -n is mentioned.

FASTQ to (unaligned) BAM/SAM/CRAM

Hi again ๐Ÿ˜„

I was wondering if a conversion utility from FASTQ to (unaligned) BAM/SAM/CRAM is something you had thought about and is in scope for this project? The alternative tools that I know for doing this (picard and fgbio) are both big collections with huge dependencies so it would be nice to have a slim Rust-based tool for doing so. (I realize that this request might make fq less slim and add more dependencies.)

Thanks for your thoughts.

Feature request: Filtering based on reads

I would love to be able to filter based on reads e.g. keep only reads starting with "GC".
Would be cool to be able to do this by regex.

Thank you for your work. I am a user of fq lint and fq subsample.

ImportError

$ python setup.py install
Traceback (most recent call last):
  File "setup.py", line 4, in <module>
    from Cython.Build import cythonize
ImportError: No module named Cython.Build

Install on mac not working with 2021 editioning

When I try to install on mac I get errors.

error: failed to parse manifest at `/Users/rorynolan/Downloads/fq/Cargo.toml`

Caused by:
  feature `edition2021` is required

  The package requires the Cargo feature called `edition2021`, but that feature is not stabilized in this version of Cargo (1.55.0).
  Consider adding `cargo-features = ["edition2021"]` to the top of Cargo.toml (above the [package] table) to tell Cargo you are opting in to use this unstable feature.
  See https://doc.rust-lang.org/nightly/cargo/reference/unstable.html#edition-2021 for more information about the status of this feature.

Caused by:
  feature `edition2021` is required

  The package requires the Cargo feature called `edition2021`, but that feature is not stabilized in this version of Cargo (1.55.0).
  Consider adding `cargo-features = ["edition2021"]` to the top of Cargo.toml (above the [package] table) to tell Cargo you are opting in to use this unstable feature.
  See https://doc.rust-lang.org/nightly/cargo/reference/unstable.html#edition-2021 for more information about the status of this feature.

When I follow those instructions, I get a different error

error: failed to compile `fq v0.8.0 (/Users/rorynolan/Downloads/fq)`, intermediate artifacts can be found at `/Users/rorynolan/Downloads/fq/target`

Caused by:
  failed to select a version for the requirement `tracing-subscriber = "^0.3.0"`
  candidate versions found which didn't match: 0.2.25, 0.2.24, 0.2.23, ...
  location searched: crates.io index
  required by package `fq v0.8.0 (/Users/rorynolan/Downloads/fq)`

ModuleNotFoundError

$ pip3 install fqlib
$ fqlint
Traceback (most recent call last):
  File "/usr/local/bin/fqlint", line 7, in <module>
    from fqlib import PairedFastQFiles, SingleReadValidationError, PairedReadValidationError
  File "/Users/edavis5/workspace/fqlib/fqlib/__init__.py", line 2, in <module>
    from .fastq import *
ModuleNotFoundError: No module named 'fqlib.fastq'

Reference genome support for fastq generation

Hi friends!

This is more of a feature request than an issue.

Have you considered implementing fastq generation from a reference genome? I think it would be a useful feature with a lot of extensibility. It might be out of scope, but I think there is a lot someone could do with this (I can imagine bed / vcf support, etc).

Use string_view instead of char*

Given the discussions in our last development meeting, I think we need a higher performing strategy than C++ strings but need a little bit more functionality than plain C strings. In particular, we want to have a strategy to cache the length and have some utility methods that are included in the C++ string implementation. Seems like string_view is a great solution: it was added in c++17, but I think that's fair game here.

References

Subsampling paired, compressed reads seems problematic

I have read pairs in FASTQ files with ~36 M reads. When I use fq to subsample the decompressed version everything works as expected (takes about 1 m 40 s for 9.5 GB pairs on my system).

time fq subsample -s 100 -n 10000000 --r1-dst sample_S1_1.fastq --r2-dst sample_S1_2.fastq sample_1.fastq sample_2.fastq
2022-02-14T13:34:25.823293Z  INFO fq::commands::subsample: fq-subsample start
2022-02-14T13:34:25.823313Z  INFO fq::commands::subsample: initializing rng from seed = 100
2022-02-14T13:34:25.823316Z  INFO fq::commands::subsample: counting records
2022-02-14T13:35:07.795301Z  INFO fq::commands::subsample: r1_src record count = 36903621
2022-02-14T13:35:07.846999Z  INFO fq::commands::subsample: building filter
2022-02-14T13:35:07.902868Z  INFO fq::commands::subsample: sampling paired end reads
2022-02-14T13:36:07.533471Z  INFO fq::commands::subsample: sampled 10000000/36903621 (27.1%) records
2022-02-14T13:36:07.533736Z  INFO fq::commands::subsample: fq-subsample end

real    1m41.727s
user    0m5.255s
sys     0m16.055s

However, if I operate directly on gzip compressed versions of the files, the record count is wrong and it takes forever. For the same files as above just compressed, I terminated the process after 20 h.

time fq subsample -s 100 -n 10000000 --r1-dst sample_S1_1.fastq.gz --r2-dst sample_S1_2.fastq.gz sample_1.fastq.gz sample_2.fastq.gz
2022-02-14T13:40:03.015235Z  INFO fq::commands::subsample: fq-subsample start
2022-02-14T13:40:03.015258Z  INFO fq::commands::subsample: initializing rng from seed = 100
2022-02-14T13:40:03.015263Z  INFO fq::commands::subsample: counting records
2022-02-14T13:40:06.577680Z  INFO fq::commands::subsample: r1_src record count = 1898334
2022-02-14T13:40:06.580380Z  INFO fq::commands::subsample: building filter
Terminated

real    1193m1.514s
user    1192m26.595s
sys     0m0.419s

`NamesValidator` issue

Hi,

I get an unexpected error for NamesValidator:

$ fq lint Auto_C1_1_val_1.fq.gz_unmapped_reads_1.fq.gz Auto_C1_2_val_2.fq.gz_unmapped_reads_2.fq.gz
2023-12-07T03:16:57.574994Z  INFO fq::commands::lint: fq-lint start
2023-12-07T03:16:57.575069Z  INFO fq::commands::lint: validating paired end reads
2023-12-07T03:16:57.575108Z  INFO fq::validators: disabled validators: []
2023-12-07T03:16:57.575116Z  INFO fq::validators: enabled single read validators: ["[S003] NameValidator", "[S004] CompleteValidator", "[S002] AlphabetValidator", "[S001] PlusLineValidator", "[S005] ConsistentSeqQualValidator", "[S006] QualityStringValidator"]
2023-12-07T03:16:57.575134Z  INFO fq::validators: enabled paired read validators: ["[P001] NamesValidator"]
2023-12-07T03:16:57.575149Z  INFO fq::commands::lint: enabled special validators: ["[S007] DuplicateNameValidator"]
2023-12-07T03:16:57.575154Z  INFO fq::commands::lint: starting validation (pass 1)
Auto_C1_1_val_1.fq.gz_unmapped_reads_1.fq.gz:1:1: [P001] NamesValidator: names mismatch: expected '@A01726:43:HTWT7DSX7:3:1101:12735:1000_2:N:0:ACACGGTT+TGGTTCGA', got '@A01726:43:HTWT7DSX7:3:1101:12735:1000_1:N:0:ACACGGTT+TGGTTCGA'

This is due to that the underscore _ in the read name is not recognized, which is added by Bismark during processing as explained here. The validator parses the read name until the first space and in this case the space is replaced with underscore hence they don't match between R1 and R2.

I believe the code needs to be fixed to handle such cases.

Missing newlines are not caught

Hey guys,
thx for this cool tool!

I just stumbled over a weird error (using a Julia-Program).

ERROR: LoadError: ArgumentError: malformed FASTQ file
Stacktrace:
[1] read!(rdr::FASTX.FASTQ.Reader{TranscodingStreams.NoopStream{BufferedStreams.BufferedInputStream{IOStream}}}, rec::FASTX.FASTQ.Record)
 ...

Hence, I used your tool to validate my FASTQ-files
However, also your tool assured me that my FASTQ-files were valid.

After some time I figured out that a final newline character was missing in my FASTQ-files.

Thus, my suggestion for this tool is to include also a warning/error for such cases.

__exit__ takes 4 arguments

object.exit(self, exc_type, exc_value, traceback)

Suggest changing exit of Timer to def exit(self, *args):

(py36) [lding@L170687 fqlib]$ ./bin/fqlint -s high ./example/03_duplicate_reads/duplicate_in_R1/R1.fastq ./example/03_duplicate_reads/duplicate_in_R1/R2.fastq
INFO : Single Read Validators: 4
DEBUG: - S001 => PluslineValidator
DEBUG: - S002 => AlphabetValidator
DEBUG: - S003 => ReadnameValidator
DEBUG: - S004 => CompleteReadValidator
INFO : Paired Read Validators: 1
DEBUG: - P001 => PairedReadnameValidator
INFO : Started reading from files...
Traceback (most recent call last):
File "./bin/fqlint", line 91, in
for readno, (read_r1, read_r2) in enumerate(pair, 1):
File "fqlib/fastq.pyx", line 245, in fqlib.fastq.PairedFastQFiles.next
result = self.next_readpair()
File "fqlib/fastq.pyx", line 271, in fqlib.fastq.PairedFastQFiles.next_readpair
raise PairedReadValidationError(
fqlib.error.PairedReadValidationError: Read Pair Number: 0
Read 1

  • File: R1.fastq
  • Line Number: 0
  • Readname: b'@K00202:217:HTLHYBBXX:2:1101:2169:998 1:N:0:NTCATTCC'
    Read 2
  • File: R2.fastq
  • Line Number: 0
  • Readname: b'@K00202:217:HTLHYBBXX:2:1101:2169:998 2:N:0:NTCATTCC'

Read names do not match.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "./bin/fqlint", line 96, in
break
TypeError: exit() takes exactly one argument (4 given)

Avoid nuking the root directory when running `make clean`

Both build/ and dist/ do not necessarily exist when make is run, setting their respective variables BUILDDIR and DISTDIR to empty strings. When invoking the clean target, it effectively executes rm -rf /*.

Steps to reproduce

$ cd /tmp
$ git clone https://github.com/stjude/fqlib.git
$ cd fqlib
$ make --dry-run clean
rm -rf /tmp/fqlib/fqlib/*.c /tmp/fqlib/fqlib/*.cpp /tmp/fqlib/fqlib/*.html /tmp/fqlib/fqlib/*.so /tmp/fqlib/fqlib/__pycache__ /* /*

Note the final two arguments in the output.

Minor Change in fq lint README

I recently figured out a minor misspelt flag in fq lint README

https://github.com/stjude-rust-labs/fq#validators
The README for Validator section states:

"Validate includes a set of validators that run on single or paired records. By default, records are validated with all rules, but validators can be disabled using --disable-valdiator CODE, where CODE is one of validators listed below."

Here, the --disable-valdiator flag is misspelled and needs to corrected to --disable-validator to avoid inconvenience to new users.

Cheers!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.