Coder Social home page Coder Social logo

raymondkiu / sequence-stats Goto Github PK

View Code? Open in Web Editor NEW
9.0 2.0 1.0 90 KB

Generate statistics from FASTQ and FASTA files. Also manipulate sequences such as renaming contigs and converting FASTQ to FASTA. Written in Bash. All in one place.

License: GNU General Public License v3.0

Shell 100.00%
genome-assembly assembly-stats fasta fastq fastq2fasta renamecontigs extract-contigs tools bioinformatics awk-script

sequence-stats's Introduction

sequence-stats

A fast and beginner-friendly program to generate statistics from FASTQ and FASTA files (written AWK and Bash), e.g. genome assembly sizes and GC content (%). Also manipulate sequences such as renaming contigs, extract contigs and converting FASTQ to FASTA. Written in Bash, no specific dependencies required, should run without problems in Linux OS. All in one place. Tested to analyse microbial/prokaryotic sequences only, might not be able to handle super large files (>10GB) - may be taking too long to do that.

If you are not a Python/Perl/C/C++ programmer but a newbie in bioinformatics, and would like to use a Linux-based software with no complex installation/special libraries/dependencies required, this is the right tool for you. For small microbial genomes, the efficiency of this software is comparable to Perl/Python-based tools. Should generate outcome within a few seconds for each file.

Installation

Simply download the package and run the program in Unix environment. Source code is available in the src directory.

Usage

Please note that sequence-stats does not handle gzipped inputs. Extensions such as .fasta, .fna or .fastq are not required.

Sequence-stats generates statistics from FASTQ reads or FASTA assemblies

For user manual please go to: https://github.com/raymondkiu/sequence-stats

Usage: sequence-stats [options] FASTA/FASTQ

Options:
 -a Print FASTA stats
 -q Print FASTQ stats
 -t Convert FASTQ to FASTA. Usage: ./sequence-stats -t FASTQ > NEWFILENAME
 -c Print individual contig's stats (FASTA)
 -d Dereplicate contigs in (multi)FASTA. Usage: ./sequence-stats -d FASTA > NEWFILENAME
 -n Rename contigs. Usage: ./sequence-stats -n FASTA PREFIX > NEWFILENAME
 -b Print FASTA stats in tabular format
 -r Print FASTQ stats in tabular format
 -e Extract contig(s) from FASTA sequences. Usage: ./sequence-stats -e CONTIG-IDs.txt FASTA > NEWFILENAME
 -s Print summary of FASTA tabular stats of multiple files using common suffix. Usage: ./sequence-stats -s SUFFIX > NEWFILENAME
 -f Filter FASTA sequences by length. Usage: ./sequence-stats -f FASTA 500 > NEWFILENAME
 -h Print usage and exit
 -v Print version and exit

Example 1: FASTA stats

Use option -a to generate stats for FASTA files such as this genome sequence file. This format is grep-friendly, if you want a tabular format, see the next example, use option -b. Tested for FASTA genome size > 10MB.

$ ./sequence-stats -a CA.fna 
Sample: CA-20.fna
Genome(bp): 2220029
Contigs: 17
GC(%): 62.76
A(%): 18.45
T(%): 18.78
G(%): 31.02
C(%): 31.73
N50: 360689
Max Contig: 898564
Min Contig: 792
N count: 10
N(%): .0004
Gap(-): 5
Uncertain(bp): 17

Example 2: FASTA stats in tabular format

$ ./sequence-stats -b CA.fna 
SampleID	Genome	Contigs	GC(%)	A(%)	T(%)	G(%)	C(%)	N50	Max	Min	Ncount	N(%)	Gap(-)	Uncertain(bp)	
CA-20.fna	2220029	17	62.76	18.45	18.78	31.02	31.73	360689	898564	792	10	.0004	5	17

Example 3: FASTQ stats

sequence-stats generates basic FASTQ stats, mainly for you to determine to total read counts and bases also the read length. Also available in tabular format. Tested for fastq file size >300MB.

$ ./sequence-stats -q V17.fastq 
Sample: V17.fastq
File size: 95M
Total bases: 37574330
Reads: 125198
Max read length: 301
Min read length: 47
Mean read length: 300.119

Example 4: Individual contig's stats

This option generates individual contig's length with contigs' IDs.

$ ./sequence-stats -c CA.fna 

NODE_1_length_898552_cov_25.252293	898564
NODE_2_length_360689_cov_22.619544	360689
NODE_3_length_310889_cov_25.950533	310889
NODE_4_length_236964_cov_27.060105	236964
NODE_5_length_137416_cov_24.247934	137416
NODE_6_length_82085_cov_21.715723	82085
NODE_7_length_74446_cov_30.811911	74446
NODE_8_length_43698_cov_36.907430	43698
NODE_9_length_38558_cov_20.173728	38558
NODE_10_length_22311_cov_54.678330	22311
NODE_11_length_5573_cov_99.941594	5573
NODE_12_length_2561_cov_142.282609	2561
NODE_13_length_1727_cov_27.903636	1727
NODE_14_length_1526_cov_48.884748	1526
NODE_15_length_1181_cov_28.868659	1181
NODE_16_length_1049_cov_51.163580	1049
NODE_17_length_792_cov_66.374825	792

Issues

Please report any issues to the issues page.

Citation

If you use sequence-stats for results in your publication, please cite:

  • Kiu R, sequence-stats: generate sequence statistics from FASTA and FASTQ files, GitHub https://github.com/raymondkiu/sequence-stats, DOI https://doi.org/10.6084/m9.figshare.16950775

License

sequence-stats is a free software licensed under GPLv3

Author

Raymond Kiu | [email protected] | @raymond_kiu

sequence-stats's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

paveleg

sequence-stats's Issues

-a option shows incorrect data.

When computing the stats for a multifasta file, the field "Min Contig:" shows the length of the last contig, not the smallest one.

~/bin/sequence-stats -a assembly_MAIOR_QUE_100K.fasta
Sample: assembly_MAIOR_QUE_100K.fasta
Genome(bp): 248306121
Contigs: 1104
GC(%): 45.64
A(%): 27.18
T(%): 27.17
G(%): 22.81
C(%): 22.83
N50: 261916
Max Contig: 402137
Min Contig: 349401
N count: 0
N(%): 0
Gap(-): 0
Uncertain(bp): 0

~/bin/sequence-stats -c assembly_MAIOR_QUE_100K.fasta | tail
contig_9766 117494
contig_9805 199014
contig_9816 106145
contig_9878 116378
contig_9886 250030
contig_991 101688
contig_9941 381455
contig_9951 145570
contig_9955 343603
contig_9982 349401

some commands not found in sequence-stats

Hi, I used git clone to install sequence-stats and gave it executable permission and run it on my file. Can you please help with the "command not found" issues? See below. Sincerely.

~$ sequence-stats/src/sequence-stats -a Tiero.fna
sequence-stats/src/sequence-stats: line 53: bc: command not found
sequence-stats/src/sequence-stats: line 54: bc: command not found
sequence-stats/src/sequence-stats: line 55: bc: command not found
sequence-stats/src/sequence-stats: line 57: bc: command not found
sequence-stats/src/sequence-stats: line 59: bc: command not found
sequence-stats/src/sequence-stats: line 62: bc: command not found
sequence-stats/src/sequence-stats: line 68: bc: command not found
Sample: Tiero.fna
Genome(bp): 4267658
Contigs: 1
GC(%):
A(%):
T(%):
G(%):
C(%):
N50: 4267658
Max Contig: 4267658
Min Contig: 4267658
N count: 0
N(%):
Gap(-): 0
Uncertain(bp):

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.