Coder Social home page Coder Social logo

alastair-droop / fqtools Goto Github PK

View Code? Open in Web Editor NEW
134.0 7.0 19.0 2.34 MB

An efficient FASTQ manipulation suite

License: GNU General Public License v3.0

Makefile 0.57% C 83.89% Shell 0.58% Python 13.41% Objective-C 0.64% C++ 0.91%
fastq fastq-files next-generation-sequencing bioinformatics

fqtools's Introduction

Introduction

fqtools is a software suite for fast processing of FASTQ files. Various file manipulations are supported. See below for a full list of the subcommands available and a brief description of their purpose. Most of the individual subcommands will take either a single file or a pair of files as input. If no input file is specified, fqtools will attempt to read data from stdin. In this case, it is advisabe to specify the format of the data provided. For subcommands that generate FASTQ data, either a single file or a pair of files will be generated. If no -o argument is provided, single files will be writted to stdout.

Citation

If you use fqtools in pblished work, please can you include a reference to my Bioinformatics paper:

  • Droop, A. P. (2016). fqtools: An efficient software suite for modern FASTQ file manipulation. Bioinformatics (Oxford, England). [DOI:10.1093/bioinformatics/btw088]

Installation

fqtools requires building against both the zlib and htslib libraries:

  • zlib is required for processing compressed (.gz) data. The code relies on several recent zlib file IO functions, so must be a version >= 1.2.3.5.
  • htslib is required for reading BAM files. If htslib is not installed, download and compile htslib. Then, alter the HTSDIR path in the fqtools Makefile to point to the htslib source directory.

If ZLib is already installed, building can be performed similar to the following:

git clone https://github.com/alastair-droop/fqtools
cd fqtools/
git clone https://github.com/samtools/htslib
cd htslib/
autoheader
autoconf 
./configure
make
make install
cd ..
make

You might need to run the make install as sudo make install. The htslib library must be installed into a location that the built fqtools program can find (as fqtools executable is dynamically linked to the htslib library). So, if you can not (or do not want to) install HTSlib, you must add the location of the libhts.so file to your LD_LIBRARY_PATH variable.

Licence

fqtools is released under the GNU General Public License version 3.

Subcommands

The fqtools suite contains the following subcommands:

  • view View FASTQ files
  • head View the first reads in FASTQ files
  • count Count FASTQ file reads
  • header View FASTQ file header data
  • sequence View FASTQ file sequence data
  • quality View FASTQ file quality data
  • header2 View FASTQ file secondary header data
  • fasta Convert FASTQ files to FASTA format
  • basetab Tabulate FASTQ base frequencies
  • qualtab Tabulate FASTQ quality character frequencies
  • type Attempt to guess the FASTQ quality encoding type
  • validate Validate FASTQ files
  • find Find FASTQ reads containing specific sequences
  • trim Trim reads in a FASTQ file
  • qualmap Translate quality values using a mapping file

Each subcommand has its own set of arguments. The global arguments are:

  • -h Show this help message and exit.
  • -v Show the program version and exit.
  • -d Allow DNA sequence bases (ACGTN)
  • -r Allow RNA sequence bases (ACGUN)
  • -a Allow ambiguous sequence bases (RYKMSWBDHV)
  • -m Allow mask sequence base (X)
  • -u Allow uppercase sequence bases
  • -l Allow lowercase sequence bases
  • -p CHR Set the pair replacement character (default "%")
  • -b BUFSIZE Set the input buffer size
  • -B BUFSIZE Set the output buffer size
  • -q QUALTYPE Set the quality score encoding
  • -f FORMAT Set the input file format
  • -F FORMAT Set the output file format
  • -i Read interleaved input file pairs
  • -I Write interleaved output file pairs

CHR

This character will be replaced by the pair value when writing paired files.

BUFSIZE

Possible suffixes are [bkMG]. If no suffix is given, value is in bytes.

QUALTYPE

  • u Do not assume specifc quality score encoding
  • s Interpret quality scores as Sanger encoded
  • o Interpret quality scores as Solexa encoded
  • i Interpret quality scores as Illumina encoded

FORMAT

  • F uncompressed FASTQ format (.fastq)
  • f compressed FASTQ format (.fastq.gz)
  • b unaligned BAM format (.bam)
  • u attempt to infer format from file extension, (default .fastq.gz)

fqtools's People

Contributors

alastair-droop avatar kloetzl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

fqtools's Issues

Add flag to fqtools validate to output sequence count

Would there be interest in a adding a flag to fqtools validate to optionally output the total count of sequences if the records were all valid? I can work on submitting a pull request, but would like a bit of direction on whether you would prefer to have the changes go in fqtools validate or fqtools count (and add an option to validate) or neither 🥲 .

The motivation is that I need to count sequences and validate records as fast as possible. It seems to me like this information could be available from fqtools validate since we have to traverse the file regardless.

Does not build per instructions, missing sam.h file.

Following the instructions on the main page:
git clone https://github.com/alastair-droop/fqtools
cd fqtools/
git clone https://github.com/samtools/htslib
cd htslib/
autoheader
autoconf
./configure
make
make install
cd ..
make

Fails at the last step with:
In file included from src/fqprocess_view.c:14:0: src/fqheader.h:22:10: fatal error: sam.h: No such file or directory #include <sam.h>

I do not find a sam.h definition in htslib. I do however find a sam.h definition with in the separate samtools project. However if I modify for the Makefile to use that location as well I receive an error about too many parameters:

In file included from ../samtools-1.10/sam.h:29:0, from src/fqheader.h:22, from src/fqfile.c:14: src/fqfile.c: In function ‘fqfile_open_read_file_bam’: ../samtools-1.10/bam.h:209:22: error: too many arguments to function ‘samtools_sam_open’ #define sam_open samtools_sam_open

Could this be because samtools htslib etc. are now separately maintained projects?

Is there support for interleaved FASTQ?

The test set indicated in the paper appears to have become in accessible, can you confirm if interleaved FASTQ is processed correctly (i.e. checks that adjacent records are from the same paired read)?

Lowercase option not working

Hi,
First i would like to thank you for this awesome tool!
I recently started using fastq files with mixed uppercase and lowercase.
I'm using fqtools 2.3 2019-05-08 (zlib 1.2.8; htslib 1.8).

A test file I created named test_lowercase.fastq which contains:

@A00740:65:HNY73DSXX:3:1103:7925:25645 1:N:0:GTCCTTGA+TAATCTTA
agccatgcactctgtaatgaagagttcacAATCTTCAACAGAGTAGATATTTCAAGAAGTCAACTGATAGATGAATTGGCAGATAAATTTAACCGGCTTCTTGAAGATTTTCTGCAAGAGGTATATATTATAACTATTACAAGTATTTTGTCAGTTgagcccctctactgcaggaa
+
FFFFFFFFFFFFFFFFFFFF:FFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFFFFFFFFFFFFFFFFFFFFFFF

The command I'm using:
fqtools -l count test_lowercase.fastq
And the error is:
ERROR [line 2]: invalid sequence character (a)

I also tried:
fqtools -l -F count test_lowercase.fastq
Which outputs a different error:

ERROR: unknown command: "test_lowercase.fastq"
usage: fqtools [-hvdramuli] [-b BUFSIZE] [-B BUFSIZE] [-q QUALTYPE] [-f FORMAT] [-F FORMAT] COMMAND [...] [FILE] [FILE]

I must say I first used an older version which I already had
fqtools 2.1 2016-10-04 (zlib 1.2.7; htslib 1.8) had the same results.
Decided to clone the repository again.

Am I using the command in the wrong way or is it a bug?

Thank you

Compiler warnings

gcc --version
gcc (Homebrew gcc 5.3.0) 5.3.0
In file included from src/fqheader.h:21:0,
                 from src/fqfile.c:14:
/usr/include/zlib.h:1324:21: note: expected 'gzFile {aka struct gzFile_s *}' but argument is of type 'struct gzFile_s *'
 ZEXTERN int ZEXPORT gzwrite OF((gzFile file,
                     ^
src/fqfile.c: In function 'fqfile_eof_file_fastq_compressed':
src/fqfile.c:285:18: warning: passing argument 1 of 'gzeof' from incompatible pointer type [-Wincompatible-pointer-types]
     return gzeof((gzFile*)(((fqfile*)f)->handle));
                  ^
In file included from src/fqheader.h:21:0,
                 from src/fqfile.c:14:
/usr/include/zlib.h:1458:21: note: expected 'gzFile {aka struct gzFile_s *}' but argument is of type 'struct gzFile_s *'
 ZEXTERN int ZEXPORT gzeof OF((gzFile file));
                     ^
src/fqfile.c: In function 'fqfile_flush_file_fastq_compressed':
src/fqfile.c:302:13: warning: passing argument 1 of 'gzflush' from incompatible pointer type [-Wincompatible-pointer-types]
     gzflush((gzFile*)(((fqfile*)f)->handle), 0);
             ^
In file included from src/fqheader.h:21:0,
                 from src/fqfile.c:14:
/usr/include/zlib.h:1395:21: note: expected 'gzFile {aka struct gzFile_s *}' but argument is of type 'struct gzFile_s *'
 ZEXTERN int ZEXPORT gzflush OF((gzFile file, int flush));
                     ^
cc -O2 -g -Wall -Wextra -Wno-unused-parameter -I/bio/linuxbrew/opt/htslib/include  -c -o src/fqfsin.o src/fqfsin.c
cc -O2 -g -Wall -Wextra -Wno-unused-parameter -I/bio/linuxbrew/opt/htslib/include  -c -o src/fqfsout.o src/fqfsout.c
src/fqfsout.c: In function 'fqfsout_writechar':
src/fqfsout.c:165:14: warning: variable 'result' set but not used [-Wunused-but-set-variable]
     fqstatus result;
              ^
src/fqfsout.c: In function 'fqfsout_write':
src/fqfsout.c:178:14: warning: variable 'result' set but not used [-Wunused-but-set-variable]
     fqstatus result;
              ^
cc -O2 -g -Wall -Wextra -Wno-unused-parameter -I/bio/linuxbrew/opt/htslib/include  -c -o src/fqfileprep.o src/fqfileprep.c
src/fqfileprep.c: In function 'prepare_filesets':
src/fqfileprep.c:86:15: warning: 'informat_2' may be used uninitialized in this function [-Wmaybe-uninitialized]
             if((outformat_2 == FQ_FORMAT_UNKNOWN) && (options.input_interleaving == FQ_INTERLEAVED)) outformat_2 = info
               ^

`fqtools type` incorrectly typing fastq

It appears that the current algorithm for fqtools type can get the fastq quality format wrong. Here's a reproducible example:

fastq-dump --split-files ERR719681
# Read 300355 spots for ERR719681
# Written 300355 spots for ERR719681
fqtools type ERR719681_1.fastq
# fastq-illumina
fqtools type ERR719681_2.fastq
# fastq-sanger

If Bio.SeqIO is then used to read these fastq files with the "type" specified by fqtools type, then the following error occurs:

  File "/ebio/abt3_projects/software/dev/llmgqc/.snakemake/conda/a289c738/lib/python3.6/site-packages/Bio/SeqIO/__init__.py", line 611, in parse
    for r in i:
  File "/ebio/abt3_projects/software/dev/llmgqc/.snakemake/conda/a289c738/lib/python3.6/site-packages/Bio/SeqIO/QualityIO.py", line 1255, in FastqIlluminaIterator
    raise ValueError("Invalid character in quality string")

Maybe using the min & max of qual values (the full range) for all sequences in the fastq file would help prevent these mis-calls?

make test error

make tests
mkdir -p bin
cc -O2 -g -L/bio/linuxbrew/opt/htslib -o./bin/fqtools src/fqprocess_view.o src/fqprocess_head.o src/fqprocess_count.o src/fqprocess_blockview.o src/fqprocess_fasta.o src/fqprocess_basetab.o src/fqprocess_qualtab.o src/fqprocess_lengthtab.o src/fqprocess_type.o src/fqprocess_validate.o src/fqprocess_find.o src/fqprocess_trim.o src/fqprocess_qualmap.o src/fqbuffer.o src/fqfile.o src/fqfsin.o src/fqfsout.o src/fqfileprep.o src/fqparser.o src/fqgenerics.o src/fqhelp.o src/fqtools.o -lz -lhts -lm
cc -O2 -g -L./src -i/bio/linuxbrew/opt/htslib -o./tests/test-fqbuffer fqtools tests/test-fqbuffer.c -lz -lhts -lm
cc: error: fqtools: No such file or directory
cc: error: unrecognized command line option '-i/bio/linuxbrew/opt/htslib'

htslib failed to install but this fixed it

Hi during the "./configure" command I got the following error: "config.status: error: cannot find input file: `config.h.in' htslib"
I fixed this by also running "autoheader" prior to configure

add a new feature 'split fq'

Hi,

I have some large fq.gz files, which takes a long time to aln, so I try to split them into small files, it worked. But my script consumes a lot memeory. Will you consider add a new feature "split fq" to your tools

Any support for fq fix?

Hi,

I am using fqtools to validate fastq files for our pipeline. I found the function fqtools validate very useful and fast. Just want to know, is there any support for automatic error fixing (at least some types of errors, e.g. unpaired reads)? Thanks a lot!

PS: I had problem with linking the htslib. So my solution is to use bioconda to install fqtools, which is completely automatic.

-hh1985

find fastq reads from specific sequences

Hello,

I have lists of sequence which I would like to find fastq reads that contain these sequences.

would it be possible to use fqtools find option to do this??

my lists of sequence looks like following

GATAAAAAAAAAAAAAAAC
GATAAAAAAAAAAAAAACC
GATAAAAAAAAAAAAAATC
GATAAAAAAAAAAAAAAGC
GATAAAAAAAAAAAAACAC
GATAAAAAAAAAAAAACCC
GATAAAAAAAAAAAAACTC
GATAAAAAAAAAAAAATAC
GATAAAAAAAAAAAAATCC
GATAAAAAAAAAAAAATGC
GATAAAAAAAAAAAAAGAC
GATAAAAAAAAAAAAAGCC
GATAAAAAAAAAAAAAGGC
GATAAAAAAAAAAAACAAC
GATAAAAAAAAAAAACACC
GATAAAAAAAAAAAACCAC
GATAAAAAAAAAAAACCCC
GATAAAAAAAAAAAACCTC
GATAAAAAAAAAAAATAAC
GATAAAAAAAAAAAATCAC
GATAAAAAAAAAAAATTAC
GATAAAAAAAAAAAAGAAC
GATAAAAAAAAAAAAGACC
GATAAAAAAAAAAACAAAC
GATAAAAAAAAAAACCCCC
GATAAAAAAAAAAATAAAC
GATAAAAAAAAAAAGAAAC
GATAAAAAAAAAACAAAAC
.
.
.
.

I have used grep to do this one by one but it's taking too long
grep -A 2 -B 1 "CTCAAAAAAAAACAAAGGA" input.fastq |grep -v "^\-\-$" > output.fastq

an empty sequence line still passes

The second entry of this fastq still passes the validator. Is it something intentional?

@LD5V2:07687:11026
CGGGGGTCTTAGCTTTGGCTCTCCTTGCAAAGTTATTTCTAGTTAATTCATTATGCAGAAGGTATAGGGGTTAGTCCTTGCTTATATTATGCTTGGTTATAATTTTTCATCTTTCCCTTGCGGTACTATATCTATTGCGACCA
+
35977*6772999990959:;:<6<.53::19;39891845..*-36159<6;::::;;6;6>:95333)52577957*...*/6774999894726268858=>=-99:99:2;>3>5:::5:;;<;;::;7/,*,3***+,
@LD5V2:07687:11043

+

@LD5V2:07688:11020
AAAATTTAACACCCATAGTAGGCCTAAAAGCAGCCACCAATTAAGAAAGCGTTCAAGCCCAACACCCACTACCTAAAAAATCCCAAACATATAACTGAACT
+
:::0:?4>7==<<4<;;;=<=7;78888.9;@><5;;4:4:4:4:;;2;<<<6<<6<<=4>6;;<=2;;;;;5565533'5::/;<2<?ABD?7<<<;5<=

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.