lh3 / bioawk Goto Github PK

View Code? Open in Web Editor NEW

590.0 590.0 120.0 130 KB

BWK awk modified for biological data

C 85.25% Makefile 1.13% Roff 6.30% Yacc 7.32%

bioinformatics sequence-analysis

bioawk's People

Contributors

Stargazers

Watchers

Forkers

arq5x brentp vsbuffalo sjackman hensonmw ctskennerton zhilongjia xtmgah snewhouse preciserobot guochangjiang al3n70rn vd4mmind aseetharam b-rich zeeshanali fw1121 srividya22 biocyberman mgalland joshuashen zengfengbo rosaak salviadr homopolymer zzygyx9119 rnayoubing snashraf conerade67 bioinfolover conradstack b1234561 realityjunky tarah28 alenzhao swatikaushik tomdeman-bio langys noahpieta scchess dpryan79 tgjoen yixf-self haroon123 variousreferences conchoecia junhuili tianyunwang tankmermaid rnandety gavinwinner biterbilen fernandalpcosta antoniogps healthvivo basesloaded hrk2109 ipstone pespindola luhuifang asavelar mikecormier inambioinfo anandksrao lshapiro31 maxlcummins neptuneyt weihongx conskier guohaoga baicalin hyinli aspirincode fengyq jessica-2019 jgbaldwinbrown mza0150 yepl rheannna xjyx donalbonny pythseq hiruna72 hzm220 yaxche-io hnnd001 imxpm qishi12345410 xh123-xyq tw7649116 wook2014 mmp3 xyl012 moonchangin ksajan wnq13579 liupfskygre jing-xinxing evgeni-nikolaev kim-fehl

bioawk's Issues

revcomp bug?

Hi,
May be I am missing something. But below seemed weird to me:

bioawk 'BEGIN{a="TTTT" ; print a , revcomp(a) }'
TTTT AAAA

bioawk 'BEGIN{a="TTTT" ; b=revcomp(a); print a,b }'
AAAA AAAA

I am using a Dell PE2900 server, running CentOS 6.

Bernardo

recipe for target 'ytab.o' failed

Hi
I want to install biohawk on the latest ubuntu (18.04) but I am getting this error after running make;

~/bioawk$ make
yacc -d awkgram.y
make: yacc: Command not found
Makefile:50: recipe for target 'ytab.o' failed
make: *** [ytab.o] Error 127

Any thoughts? (sorry if it is an easy fix)
Thanks

Tag a stable release version

Please tag a stable release version. Thanks!

compilation fatal error - missing zlib.h

Hi,

I tried to compile bioawk after cloning it from this repo and installing byacc (using sudo apt install byacc on elementary OS Hera / Ubuntu 18.04), but I got fatal error:

jena@cloudbook:~/bin/bioawk$ make
yacc -d awkgram.y
yacc: 43 shift/reduce conflicts, 85 reduce/reduce conflicts.
mv y.tab.c ytab.c
mv y.tab.h ytab.h
gcc -g -Wall -O2 -c ytab.c
gcc -g -Wall -O2   -c -o b.o b.c
gcc -g -Wall -O2   -c -o main.o main.c
gcc -g -Wall -O2   -c -o parse.o parse.c
gcc -g -Wall -O2 maketab.c -o maketab
./maketab >proctab.c
gcc -g -Wall -O2   -c -o proctab.o proctab.c
gcc -g -Wall -O2   -c -o tran.o tran.c
gcc -g -Wall -O2   -c -o lib.o lib.c
gcc -g -Wall -O2   -c -o run.o run.c
gcc -g -Wall -O2   -c -o lex.o lex.c
gcc -g -Wall -O2   -c -o addon.o addon.c
addon.c:241:10: fatal error: zlib.h: Adresář nebo soubor neexistuje
 #include <zlib.h> /* FIXME: it would be better to drop this dependency... */
          ^~~~~~~~
compilation terminated.
make: *** [<vestavěný>: addon.o] Chyba 1

Any help would be welcome.

Installation issue

I installed it as below...

sudo apt-get install bison
git clone https://github.com/lh3/bioawk.git
cd bioawk
make

Everything works well, but when I typed "bioawk" even under bioawk folder, got a error message "bioawk: command not found". How can I fix this?

How to cite bioawk?

Hi,
How can I cite bioawk? I am using it for my analysis.
Best Regards
Zillur

Segmentation fault (core dumped)

When i run a test order:
bioawk -c fastx 'END{print NR}' chr17_fragm.fasta
it says:
Segmentation fault (core dumped)
How should i do? Thanks very much for every help.

Hej,
thanks a lot for the tool, I really love it! But I have one question I couldn't find a proper way to solve it:
How can I work on the third line in a fastq file (the +-comment line)? When manipulating fastx one can access the 4 variables 1:name 2:seq 3:qual 4:comment, of which the $comment-variable does not mean the comment line but everything behind the name in the read-header. Is it possible to access the values in the comment-line though?

Thanks a lot and best regards!
Philipp

bioawk trimming the protein sequences in the start

Hello,

I am trying to convert the .faa format protein sequences into OrthoMCL readable format (organism_ID|protein_ID) using the bioawk -c fastx '{ print ">GMI1000|"$name; print $seq }'. I am only getting the results with around 900 sequences out of 4000. I found that bioawk is not reading the sequences from first 3000 proteins in the .faa format.
Is there any way to solve this problem?

Thank you very much in advance!!

Manual in rst format

Hi Heng,

what do you think of also adding the manual in restructured text format like this:

https://github.com/ialbert/bioawk/blob/master/README.bio.rst

as you can see GitHub recognizes it and automatically generates html webpage from it. It makes easy to document the project.

header does not respect FS flag in BEGIN

this will demonstrate the problem:

Given his text file

echo $'a space\tb\n1\t2\n1\t2' > t.txt

and this code:

./awk 'BEGIN{FS="\t"}{ print $b }' t.txt

I expect:

b
2
2

but it instead gives nothing because it splits the headers on space sets them as are a, space,b
even though the separator is '\t'. This is because the headers are parsed before the BEGIN block
sets FS.

bioawk vs awk give different results (locale-specific)

Hi,

I'm trying to check my VCF filtering and I wanted to play around with bioawk but I noticed it gives different results than regular awk present on my system (it's a server that should have OpenSUSE according to admin).

See this example:

grep -v '#' r123_newrbmcl100_fb.vcf | awk '$6>20 {print $1,$2,$6}' | head
E57432_L179 11 7368.02
E57432_L179 19 8461.42
E57432_L179 35 12953.8
E57432_L179 43 14922.7
E57432_L179 51 19613.4
E57432_L179 62 20413.3
E57432_L179 70 20282.5
E57432_L179 108 25482.3
E57432_L179 118 27017.1
E57432_L179 149 24639.4

compared to:

grep -v '#' r123_newrbmcl100_fb.vcf | bioawk '$6>20 {print $1,$2,$6}' | head
E57432_L179 11 7368.02
E57432_L179 19 8461.42
E57432_L179 62 20413.3
E57432_L179 70 20282.5
E57432_L179 108 25482.3
E57432_L179 118 27017.1
E57432_L179 124 9.09527e-05
E57432_L179 128 4.50905e-05
E57432_L179 129 3.73425e-05
E57432_L179 138 4.61921e-05

Note the only difference between the commands is replacing awk with bioawk - no special features of bioawk are called. However I noticed this originally when I did use the features (namely -tc hdr and calling variables with colnames) and I got the same wrong result. I then tested it as above.

For some reason bioawk doesn't print some lines that pass the $6>20 filter (like positions 35, 43, and 51) while it prints possitions that should not pass the filter (positions 124-138). System awk prints it correctly.

I installed bioawk using conda manager and bioconda channel. The version is 1.0 and build h84994c4_4.

Thanks for any input on what could go wrong.

bioawk does not stop parsing a file on `nextfile`

Scenario: I have 4 FASTQ files in my current directory, and I want to print the first 10 seq names of each file. Ideally, the following command should work very quickly:

bioawk -c fastx 'FNR<11 {print $name} FNR==11{nextfile}' *.fastq

For example, with regular awk, the following works extremely fast:

awk 'FNR==1{print} FNR>1{nextfile}' *.fastq

But bioawk does not stop processing the file when it encounters nextfile. Instead, it seems to continue parsing the whole file, which is crazy when dealing with gigantic files. Plus, it seems to skip a file when running into the nextfile command, which is weird.

bioawk -c fastx sets OFS=\\t with no way to change it.

Hello,

When using bioawk with the -c fastx option the OFS parameter is set to a tab (\t) with no way to change it.

Reproduced by building from source.

Test:

bioawk -c fastx -v OFS=" " '{print OFS}' some.fasta | head -1 | od -bc

Will output:

0000000 011 012
         \t  \n
0000002

Note: some.fasta can be anything (if at least it is a fasta file), but not providing any file will not trigger this.

header: multiple input files causes multiple headers

Multiple input files causes the header to be printed multiple times in the output. The desired output would have a single header. Ideally, bioawk would check that the headers of each input file are identical.

bioawk -c header 1 foo bar

match with regexp does not work for me

Dear,

I try to extract fastq reads that start with a 7-base UMI followed by a primer sequence.

They should match the regexp /^[ATGC]{7}GCCCAGGTGCAGCTGCAG/ but it does not work when I add the UMI part on the left as a regexp.

It seems that the match with // fails for regexp expression.
Is it me or do I need to write this differently?
Thanks for help

# fails
bioawk -c fastx 'match($seq, /^[ATGC]{7}GCCCAGGTGCAGCTGCAG/) {print "@"$name" "$comment"\n"$seq"\n+\n"$qual}' sampleR1.fq

# fails
bioawk -c fastx 'match($seq, /^[:alpha:]{7}GCCCAGGTGCAGCTGCAG/) {print "@"$name" "$comment"\n"$seq"\n+\n"$qual}' sampleR1.fq

# fails
bioawk -c fastx 'match($seq, /^.{7}GCCCAGGTGCAGCTGCAG/) {print "@"$name" "$comment"\n"$seq"\n+\n"$qual}' sampleR1.fq

while 

# works
bioawk -c fastx 'match($seq, /GCCCAGGTGCAGCTGCAG/) {print "@"$name" "$comment"\n"$seq"\n+\n"$qual}' sampleR1.fq

# also works 
# but I would prefer to use a true regexp for more readability
bioawk -c fastx 'match($seq, /^.......GCCCAGGTGCAGCTGCAG/) {print "@"$name" "$comment"\n"$seq"\n+\n"$qual}' sampleR1.fq

sampleR1.fq

@M00270:128:000000000-JV5LR:1:1101:21126:1000 1:N:0:1
NGGCTGGGCCCAGGTGCAGCTGCAGGAGTCTGGGGGAGGATTGGTACAGGCTGGGGGCTCTCTGAGACTCTCCTGTGTAGCCTCTGGAGGCACCTTCAGTACCTATGCCATGGGCTGGTTCCGCCAGGCTCCAGGGAAGGAGCGTGAGTTTGTAACACGTATTAACCGGAGTAATGATCGCACATTGTATTCGGACTCCGTGAAGGGCCGATTCACCATTTTGAGAGACAACGCCAAGAGCACAGTGTATCTACAAATGAACAGCCTGAAAACTGAGGACACGGCCGTTTATTATTGTTC
+
!8@BCFEGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGG:FGGGGCFGGFGGGGGGGGGFGGGGGGGGGGGGGGGGGEGGGGGGGGGFFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDFGGGGGGGGFFGGGGGFGFD@FEFFGGGDFGGG7F:BG,@B9DCCDGGGGGDBDFGFGGECG?*<BCCFEFGACCECCFEFF6BDCGGGGGCE6FF60CGGCFE7CG6CC7+<FG?8CGFCFG+CFFFGF0<1>E3D>EECF@7CFGFBF*)

FR: add asorti function

Dear,

if possible, could asort and asorti be added to the bioawk source, these functions would be very interesting when parsing large biodata files and producing hash during the process to finally output the sorted by value hash in the END{} block.

thanks in advance

Segmentation fault on empty files

bioawk segfaults when asked to parse an empty files

$ touch test.fastq
$ gzip test.fastq 
$ bioawk -c fastx '{print}' test.fastq.gz 
Segmentation fault

Actually, it also segfaults on non-gzipped input:

$ touch test.fastq
$ bioawk -c fastx '{print}' test.fastq
Segmentation fault

Is it something easily fixable?

make fails

make
yacc -d awkgram.y
make: yacc: Command not found
Makefile:50: recipe for target 'ytab.o' failed
make: *** [ytab.o] Error 127

no ytab.o file while Makefile is including one

solved after installing bison (not present under ubuntu by default

Maybe nice to add to the readme.

Thanks for the great tool

mv .gitigore .gitignore

Self-explanatory, I guess.

bioawk -cfastx has a silent bug when parsing FASTA files with empty lines

Here's an example:

 echo ">a\\nATAGA\\n\\n>b\\nAGATTAG\\n>c\\nAGATAG"| bioawk -cfastx '{print $name}'

I checked and kseq.h does handle this correctly.

Put out new release tag since new functionality.

v1.1?

[notbug] unexpected termination on fasta.gz

Unexpected termination returns

2 �
9590 >

by the command:

~/miniconda3/bin/bioawk '/^>/ {count[substr($0,1,1)]++}END{for(j in count) print count[j],j}' SILVA_128_SSURef_tax_silva_full_align_trunc.fasta.gz

Reference gzip and mawk commands (doing the same count) are OK:

time zcat *fasta.gz|./bioawk '/^>/ {count[substr($0,1,1)]++}END{for(j in count) print count[j],j}'
1922213 >

real	47m44.564s
user	46m22.997s
sys	2m48.813s

time zcat *fasta.gz|mawk '/^>/ {count[substr($0,1,1)]++}END{for(j in count) print count[j],j}'
1922213 >

real	10m47.346s
user	11m31.835s
sys	1m44.985s

Is it similar to zgrep, which can search both in compressed and not compressed, or -c fastx should be explicit?

What is the license of bioawk?

I wonder what is the license of bioawk?

ARGIND not support by bioawk?

Hi?

ARGIND is not support by bioawk?

If we support ARGIND , it will be a easy way to process one file based on another file.

such as finding the intersect part of them.

Best Regards

help for gff format

Hi- First, thanks for developing bioawk!
I noticed that the description of the gff format from bioawk -c help has 10 fields, last two being "group" and "attribute".

bioawk -c help
...
gff:
    1:seqname 2:source 3:feature 4:start 5:end 6:score 7:filter 8:strand 9:group 10:attribute 
...

Shouldn't there be 9 fields, where the 9th is either called "attribute" or "group"? For reference: https://www.sanger.ac.uk/resources/software/gff/spec.html#t_2 or http://genome.ucsc.edu/FAQ/FAQformat.html#format3

Dario

bioawk

how to solve this problem when trying to run make in bioawk file?
yacc -d awkgram.y
awkgram.y: warning: 43 shift/reduce conflicts [-Wconflicts-sr]
awkgram.y: warning: 85 reduce/reduce conflicts [-Wconflicts-rr]
mv y.tab.c ytab.c
mv y.tab.h ytab.h
gcc -g -Wall -O2 -c ytab.c
make: gcc: Command not found
Makefile:50: recipe for target 'ytab.o' failed
make: *** [ytab.o] Error 127

Make array of VCF info and GFF group field.

GTF file:

$ zcat ./test.gtf.gz
chr21   hg19_knownGene  exon    33026871    33027740    0.000000    -   .   gene_id "uc002yoz.1"; transcript_id "uc002yoz.1"; 
chr21   hg19_knownGene  exon    33030247    33030540    0.000000    -   .   gene_id "uc002yoz.1"; transcript_id "uc002yoz.1"; 
chr21   hg19_knownGene  exon    33031710    33031813    0.000000    -   .   gene_id "uc002yoz.1"; transcript_id "uc002yoz.1"; 
chr21   hg19_knownGene  start_codon 33032083    33032085    0.000000    +   .   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  CDS 33032083    33032154    0.000000    +   0   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  exon    33031935    33032154    0.000000    +   .   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  CDS 33036103    33036199    0.000000    +   0   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  exon    33036103    33036199    0.000000    +   .   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  CDS 33038762    33038831    0.000000    +   2   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  exon    33038762    33038831    0.000000    +   .   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  CDS 33039571    33039688    0.000000    +   1   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  exon    33039571    33039688    0.000000    +   .   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  CDS 33040784    33040888    0.000000    +   0   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  stop_codon  33040889    33040891    0.000000    +   .   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  exon    33040784    33041243    0.000000    +   .   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";

List some fields of GTF with bioawk:

$ ./bioawk -c gff '{ print $feature,$start,$group }' ./test.gtf.gz
exon    33026871    gene_id "uc002yoz.1"; transcript_id "uc002yoz.1"; 
exon    33030247    gene_id "uc002yoz.1"; transcript_id "uc002yoz.1"; 
exon    33031710    gene_id "uc002yoz.1"; transcript_id "uc002yoz.1"; 
start_codon 33032083    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
CDS 33032083    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
exon    33031935    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
CDS 33036103    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
exon    33036103    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
CDS 33038762    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
exon    33038762    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
CDS 33039571    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
exon    33039571    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
CDS 33040784    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
stop_codon  33040889    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
exon    33040784    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";

It would be nice if for the GFF group field and the VCF info field, $group and $info,
is an array which used each subfeature as key:

$ ./bioawk -c gff '{ print $feature,$start,$group[gene_id],$group[transcript_id] }' ./test.gtf.gz
exon    33026871    uc002yoz.1  uc002yoz.1
exon    33030247    uc002yoz.1  uc002yoz.1
exon    33031710    uc002yoz.1  uc002yoz.1
start_codon 33032083    uc002ypa.3  uc002ypa.3
CDS 33032083    uc002ypa.3  uc002ypa.3
exon    33031935    uc002ypa.3  uc002ypa.3
CDS 33036103    uc002ypa.3  uc002ypa.3
exon    33036103    uc002ypa.3  uc002ypa.3
CDS 33038762    uc002ypa.3  uc002ypa.3
exon    33038762    uc002ypa.3  uc002ypa.3
CDS 33039571    uc002ypa.3  uc002ypa.3
exon    33039571    uc002ypa.3  uc002ypa.3
CDS 33040784    uc002ypa.3  uc002ypa.3
stop_codon  33040889    uc002ypa.3  uc002ypa.3
exon    33040784    uc002ypa.3  uc002ypa.3

At the moment it is not supported and I get the following error:

./bioawk: can't assign to group; it's an array name.
 source line number 1

VCF header causes NR mismatch

I noticed that NR is off by the size of a VCF header, as if NR was set before header is removed (the header normally doesn't print with -c vcf flag).

This means if my VCF has header of 10 lines, the first variant is at NR==11, which is not expected.

addon.c:274:3: warning: implicit declaration of function ‘fileno’ ==> POSIX extension

I get some warnings when compiling bioawk:

$ make
yacc -d awkgram.y
conflicts: 43 shift/reduce, 85 reduce/reduce
mv y.tab.c ytab.c
mv y.tab.h ytab.h
gcc -g -Wall -O2 -ansi -c ytab.c
gcc -g -Wall -O2 -ansi   -c -o b.o b.c
gcc -g -Wall -O2 -ansi   -c -o main.o main.c
gcc -g -Wall -O2 -ansi   -c -o parse.o parse.c
gcc -g -Wall -O2 -ansi maketab.c -o maketab
./maketab >proctab.c
gcc -g -Wall -O2 -ansi   -c -o proctab.o proctab.c
gcc -g -Wall -O2 -ansi   -c -o tran.o tran.c
gcc -g -Wall -O2 -ansi   -c -o lib.o lib.c
gcc -g -Wall -O2 -ansi   -c -o run.o run.c
gcc -g -Wall -O2 -ansi   -c -o lex.o lex.c
lex.c: In function ‘gettok’:
lex.c:157:9: warning: ignoring return value of ‘strtod’, declared with attribute warn_unused_result
gcc -g -Wall -O2 -ansi   -c -o addon.o addon.c
addon.c: In function ‘bio_getrec’:
addon.c:274:3: warning: implicit declaration of function ‘fileno’
gcc -g -Wall -O2 -ansi ytab.o b.o main.o parse.o proctab.o tran.o lib.o run.o lex.o addon.o  -o awk -lm -l

fileno is a POSIX extension:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37771
http://stackoverflow.com/questions/9427145/how-do-i-remove-the-following-implicit-declaration-of-function-warnings

Compiling bioawk with the following flags (without -ansi, doesn't give the addon.c:274:3: warning: implicit declaration of function ‘fileno’ error.

make CFLAGS='-g -Wall -O2'

Hex literals are misinterpreted

% awk 'BEGIN{print 0x4}'
4

% bioawk 'BEGIN{print 0x4}'
0

bioawk -c fastx -c2 sam 'NR==FNR{...}NR!=FNR{...}' test.fq test.sam