Coder Social home page Coder Social logo

sanger-pathogens / fastaq Goto Github PK

View Code? Open in Web Editor NEW
69.0 19.0 19.0 351 KB

Python3 scripts to manipulate FASTA and FASTQ files

License: Other

Python 98.79% Roff 0.16% Dockerfile 1.04%
genomics sequencing next-generation-sequencing research bioinformatics global-health infectious-diseases pathogen

fastaq's Introduction

Fastaq

Manipulate FASTA and FASTQ files

Build Status
License: GPL v3

Contents

Introduction

Python3 script to manipulate FASTA and FASTQ (and other format) files, plus API for developers

Installation

There are a number of ways to install Fastaq and details are provided below. If you encounter an issue when installing Fastaq please contact your local system administrator. If you encounter a bug please log it here or email us at [email protected].

Using pip3

pip3 install pyfastaq

From source

Download the latest release from this github repository or clone the repository. Then run the tests:

python3 setup.py test

If the tests all pass, install:

python3 setup.py install

Running the tests

The test can be run from the top level directory:

python3 setup.py test

Runtime dependencies

These must be available in your path at run time:

  • samtools 0.1.19
  • gzip
  • gunzip

Usage

The installation will put a single script called fastaq in your path. The usage is:

fastaq <command> [options]

Key points:

  • To list the available commands and brief descriptions, just run fastaq
  • Use fastaq command -h or fastaq command --help to get a longer description and the usage of that command.
  • The type of input file is automatically detected. Currently supported: FASTA, FASTQ, GFF3, EMBL, GBK, Phylip.
  • fastaq only manipulates sequences (and quality scores if present), so annotation is ignored where present in the input.
  • Input and output files can be gzipped. An input file is assumed to be gzipped if its name ends with .gz. To gzip an output file, just name it with .gz at the end.
  • You can use a minus sign for a filename to use stdin or stdout, so commands can be piped together. See the example below.

Examples

Reverse complement all sequences in a file:

fastaq reverse_complement in.fastq out.fastq

Reverse complement all sequences in a gzipped file, then translate each sequence:

fastaq reverse_complement in.fastq.gz - | fastaq translate - out.fasta

Available commands

Command Description
acgtn_only Replace every non acgtnACGTN with an N
add_indels Deletes or inserts bases at given position(s)
caf_to_fastq Converts a CAF file to FASTQ format
capillary_to_pairs Converts file of capillary reads to paired and unpaired files
chunker Splits sequences into equal sized chunks
count_sequences Counts the sequences in input file
deinterleave Splits interleaved paired file into two separate files
enumerate_names Renames sequences in a file, calling them 1,2,3... etc
expand_nucleotides Makes every combination of degenerate nucleotides
fasta_to_fastq Convert FASTA and .qual to FASTQ
filter Filter sequences to get a subset of them
get_ids Get the ID of each sequence
get_seq_flanking_gaps Gets the sequences flanking gaps
interleave Interleaves two files, output is alternating between fwd/rev reads
make_random_contigs Make contigs of random sequence
merge Converts multi sequence file to a single sequence
replace_bases Replaces all occurrences of one letter with another
reverse_complement Reverse complement all sequences
scaffolds_to_contigs Creates a file of contigs from a file of scaffolds
search_for_seq Find all exact matches to a string (and its reverse complement)
sequence_trim Trim exact matches to a given string off the start of every sequence
sort_by_name Sorts sequences in lexographical (name) order
sort_by_size Sorts sequences in length order
split_by_base_count Split multi sequence file into separate files
strip_illumina_suffix Strips /1 or /2 off the end of every read name
to_fake_qual Make fake quality scores file
to_fasta Converts a variety of input formats to nicely formatted FASTA format
to_mira_xml Create an xml file from a file of reads, for use with Mira assembler
to_orfs_gff Writes a GFF file of open reading frames
to_perfect_reads Make perfect paired reads from reference
to_random_subset Make a random sample of sequences (and optionally mates as well)
to_tiling_bam Make a BAM file of reads uniformly spread across the input reference
to_unique_by_id Remove duplicate sequences, based on their names. Keep longest seqs
translate Translate all sequences in input nucleotide sequences
trim_Ns_at_end Trims all Ns at the start/end of all sequences
trim_contigs Trims a set number of bases off the end of every contig
trim_ends Trim fixed number of bases of start and/or end of every sequence
version Print version number and exit

For developers

Here is a template for counting the sequences in a FASTA or FASTQ file:

from pyfastaq import sequences
seq_reader = sequences.file_reader(infile)
count = 0
for seq in seq_reader:
    count += 1
print(count)

Hopefully you get the idea and there are plenty of examples in tasks.py. Detection of the input file type and whether gzipped or not is automatic. See help(sequences) for the various methods already defined in the classes Fasta and Fastq.

License

Fastaq is free software, licensed under GPLv3.

Feedback/Issues

Please report any issues to the issues page or email [email protected].

fastaq's People

Contributors

andrewjpage avatar aslett1 avatar bewt85 avatar js21 avatar jssoares avatar martinghunt avatar mbhall88 avatar satta avatar ssjunnebo avatar trstickland avatar vaofford avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fastaq's Issues

Split input files into batches

Hi,
I realized fastaq batch woud be handy, aiming to split input files into batches of N entries per output array of files. For example, to split input FASTA file into many files with say 1000 entries per each output file.

[python 2.7 issues]

I just use python 2.7(as my server administrator insist)๏ผ›

So will it work under python 2.7?

Ambiguous nucleotides treated literally by sequence_trim

Ambiguous nucleotides are treated literally by sequence_trim. e.g. specifying AAAN to be trimmed will trim "AAAN" from sequence ends, but not "AAAA", "AAAC" etc. which is probably fair enough, but might warrant a warning message to the terminal in case the user is expecting ambiguity codes to be interpreted.

For sequence_trim it is unclear what the format for trim_seqs should be

Apologies if this is extremely basic, I'm new to sequence data/file manipulations at this level.

I've tried formatting the trim_seqs input file as a fasta, and as a plain text but nothing works. Unless I'm missing something or doing something painfully stupid in my current state of ignorance...

Any help would be appreciated.

fastaq: ValueError: empty range for randrange() (191,191, 0)

Hi,
it seems fastaq breaks on some lines. Second, could there be an option to keep the remainder sequence, instead of discarding it?

...
Warning, sequence  1632257 191 25609  too short.  Skipping it...
Warning, sequence  1632262 318 5425  too short.  Skipping it...
Warning, sequence  1632263 187 200  too short.  Skipping it...
Warning, sequence  1632264 282 990  too short.  Skipping it...
Warning, sequence  1632268 319 3229  too short.  Skipping it...
Warning, sequence  1632275 326 2232  too short.  Skipping it...
Warning, sequence  1632278 319 2078  too short.  Skipping it...
Warning, sequence  1632279 311 1726  too short.  Skipping it...
Traceback (most recent call last):
  File "/usr/lib/python-exec/python3.5/fastaq", line 71, in <module>
    exec('pyfastaq.runners.' + task + '.run("' + tasks[task] + '")')
  File "<string>", line 1, in <module>
  File "/usr/lib64/python3.5/site-packages/pyfastaq/runners/to_perfect_reads.py", line 51, in run
    middle_pos = random.randint(ceil(0.5 *isize), floor(len(ref) - 0.5 * isize))
  File "/usr/lib64/python3.5/random.py", line 227, in randint
    return self.randrange(a, b+1)
  File "/usr/lib64/python3.5/random.py", line 205, in randrange
    raise ValueError("empty range for randrange() (%d,%d, %d)" % (istart, istop, width))
ValueError: empty range for randrange() (191,191, 0)
$ fastaq version
3.17.0
$

Tests pass with `pytest`

Nose, which the package depends on, will no longer be maintained in the future, and has had issue with newer Python versions.

I switched the Gentoo package of Fastaq from nose to pytest, and it appears to work, with all tests passing: https://ppb.chymera.eu/a23c8b.log

I would recommend making the switch here as well, though for our part, it's fully handled downstream.

fastq reverse?

Hello,
I see that in commands you can reverse complement your sequences, I was wondering if there is a way to only reverse your sequences.

Thank you!!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.