lh3 / bioawk Goto Github PK
View Code? Open in Web Editor NEWBWK awk modified for biological data
BWK awk modified for biological data
Hi,
May be I am missing something. But below seemed weird to me:
bioawk 'BEGIN{a="TTTT" ; print a , revcomp(a) }'
TTTT AAAA
bioawk 'BEGIN{a="TTTT" ; b=revcomp(a); print a,b }'
AAAA AAAA
I am using a Dell PE2900 server, running CentOS 6.
Bernardo
Hi
I want to install biohawk on the latest ubuntu (18.04) but I am getting this error after running make;
~/bioawk$ make
yacc -d awkgram.y
make: yacc: Command not found
Makefile:50: recipe for target 'ytab.o' failed
make: *** [ytab.o] Error 127
Any thoughts? (sorry if it is an easy fix)
Thanks
Please tag a stable release version. Thanks!
Hi,
I tried to compile bioawk after cloning it from this repo and installing byacc
(using sudo apt install byacc
on elementary OS Hera / Ubuntu 18.04), but I got fatal error:
jena@cloudbook:~/bin/bioawk$ make
yacc -d awkgram.y
yacc: 43 shift/reduce conflicts, 85 reduce/reduce conflicts.
mv y.tab.c ytab.c
mv y.tab.h ytab.h
gcc -g -Wall -O2 -c ytab.c
gcc -g -Wall -O2 -c -o b.o b.c
gcc -g -Wall -O2 -c -o main.o main.c
gcc -g -Wall -O2 -c -o parse.o parse.c
gcc -g -Wall -O2 maketab.c -o maketab
./maketab >proctab.c
gcc -g -Wall -O2 -c -o proctab.o proctab.c
gcc -g -Wall -O2 -c -o tran.o tran.c
gcc -g -Wall -O2 -c -o lib.o lib.c
gcc -g -Wall -O2 -c -o run.o run.c
gcc -g -Wall -O2 -c -o lex.o lex.c
gcc -g -Wall -O2 -c -o addon.o addon.c
addon.c:241:10: fatal error: zlib.h: Adresář nebo soubor neexistuje
#include <zlib.h> /* FIXME: it would be better to drop this dependency... */
^~~~~~~~
compilation terminated.
make: *** [<vestavěný>: addon.o] Chyba 1
Any help would be welcome.
I installed it as below...
Everything works well, but when I typed "bioawk" even under bioawk folder, got a error message "bioawk: command not found". How can I fix this?
Hi,
How can I cite bioawk? I am using it for my analysis.
Best Regards
Zillur
When i run a test order:
bioawk -c fastx 'END{print NR}' chr17_fragm.fasta
it says:
Segmentation fault (core dumped)
How should i do? Thanks very much for every help.
Hej,
thanks a lot for the tool, I really love it! But I have one question I couldn't find a proper way to solve it:
How can I work on the third line in a fastq file (the +-comment line)? When manipulating fastx one can access the 4 variables 1:name 2:seq 3:qual 4:comment, of which the $comment-variable does not mean the comment line but everything behind the name in the read-header. Is it possible to access the values in the comment-line though?
Thanks a lot and best regards!
Philipp
Hello,
I am trying to convert the .faa format protein sequences into OrthoMCL readable format (organism_ID|protein_ID) using the bioawk -c fastx '{ print ">GMI1000|"$name; print $seq }'. I am only getting the results with around 900 sequences out of 4000. I found that bioawk is not reading the sequences from first 3000 proteins in the .faa format.
Is there any way to solve this problem?
Thank you very much in advance!!
Hi Heng,
what do you think of also adding the manual in restructured text format like this:
https://github.com/ialbert/bioawk/blob/master/README.bio.rst
as you can see GitHub recognizes it and automatically generates html webpage from it. It makes easy to document the project.
this will demonstrate the problem:
Given his text file
echo $'a space\tb\n1\t2\n1\t2' > t.txt
and this code:
./awk 'BEGIN{FS="\t"}{ print $b }' t.txt
I expect:
b
2
2
but it instead gives nothing because it splits the headers on space sets them as are a
, space
,b
even though the separator is '\t'. This is because the headers are parsed before the BEGIN block
sets FS.
Hi,
I'm trying to check my VCF filtering and I wanted to play around with bioawk but I noticed it gives different results than regular awk present on my system (it's a server that should have OpenSUSE according to admin).
See this example:
grep -v '#' r123_newrbmcl100_fb.vcf | awk '$6>20 {print $1,$2,$6}' | head
E57432_L179 11 7368.02
E57432_L179 19 8461.42
E57432_L179 35 12953.8
E57432_L179 43 14922.7
E57432_L179 51 19613.4
E57432_L179 62 20413.3
E57432_L179 70 20282.5
E57432_L179 108 25482.3
E57432_L179 118 27017.1
E57432_L179 149 24639.4
compared to:
grep -v '#' r123_newrbmcl100_fb.vcf | bioawk '$6>20 {print $1,$2,$6}' | head
E57432_L179 11 7368.02
E57432_L179 19 8461.42
E57432_L179 62 20413.3
E57432_L179 70 20282.5
E57432_L179 108 25482.3
E57432_L179 118 27017.1
E57432_L179 124 9.09527e-05
E57432_L179 128 4.50905e-05
E57432_L179 129 3.73425e-05
E57432_L179 138 4.61921e-05
Note the only difference between the commands is replacing awk with bioawk - no special features of bioawk are called. However I noticed this originally when I did use the features (namely -tc hdr
and calling variables with colnames) and I got the same wrong result. I then tested it as above.
For some reason bioawk doesn't print some lines that pass the $6>20
filter (like positions 35, 43, and 51) while it prints possitions that should not pass the filter (positions 124-138). System awk prints it correctly.
I installed bioawk using conda manager and bioconda channel. The version is 1.0 and build h84994c4_4.
Thanks for any input on what could go wrong.
Scenario: I have 4 FASTQ files in my current directory, and I want to print the first 10 seq names of each file. Ideally, the following command should work very quickly:
bioawk -c fastx 'FNR<11 {print $name} FNR==11{nextfile}' *.fastq
For example, with regular awk, the following works extremely fast:
awk 'FNR==1{print} FNR>1{nextfile}' *.fastq
But bioawk does not stop processing the file when it encounters nextfile
. Instead, it seems to continue parsing the whole file, which is crazy when dealing with gigantic files. Plus, it seems to skip a file when running into the nextfile
command, which is weird.
Hello,
When using bioawk with the -c fastx option the OFS parameter is set to a tab (\t) with no way to change it.
Reproduced by building from source.
Test:
bioawk -c fastx -v OFS=" " '{print OFS}' some.fasta | head -1 | od -bc
Will output:
0000000 011 012
\t \n
0000002
Note: some.fasta can be anything (if at least it is a fasta file), but not providing any file will not trigger this.
Multiple input files causes the header to be printed multiple times in the output. The desired output would have a single header. Ideally, bioawk would check that the headers of each input file are identical.
bioawk -c header 1 foo bar
Dear,
I try to extract fastq reads that start with a 7-base UMI followed by a primer sequence.
They should match the regexp /^[ATGC]{7}GCCCAGGTGCAGCTGCAG/ but it does not work when I add the UMI part on the left as a regexp.
It seems that the match with // fails for regexp expression.
Is it me or do I need to write this differently?
Thanks for help
# fails
bioawk -c fastx 'match($seq, /^[ATGC]{7}GCCCAGGTGCAGCTGCAG/) {print "@"$name" "$comment"\n"$seq"\n+\n"$qual}' sampleR1.fq
# fails
bioawk -c fastx 'match($seq, /^[:alpha:]{7}GCCCAGGTGCAGCTGCAG/) {print "@"$name" "$comment"\n"$seq"\n+\n"$qual}' sampleR1.fq
# fails
bioawk -c fastx 'match($seq, /^.{7}GCCCAGGTGCAGCTGCAG/) {print "@"$name" "$comment"\n"$seq"\n+\n"$qual}' sampleR1.fq
while
# works
bioawk -c fastx 'match($seq, /GCCCAGGTGCAGCTGCAG/) {print "@"$name" "$comment"\n"$seq"\n+\n"$qual}' sampleR1.fq
# also works
# but I would prefer to use a true regexp for more readability
bioawk -c fastx 'match($seq, /^.......GCCCAGGTGCAGCTGCAG/) {print "@"$name" "$comment"\n"$seq"\n+\n"$qual}' sampleR1.fq
sampleR1.fq
@M00270:128:000000000-JV5LR:1:1101:21126:1000 1:N:0:1
NGGCTGGGCCCAGGTGCAGCTGCAGGAGTCTGGGGGAGGATTGGTACAGGCTGGGGGCTCTCTGAGACTCTCCTGTGTAGCCTCTGGAGGCACCTTCAGTACCTATGCCATGGGCTGGTTCCGCCAGGCTCCAGGGAAGGAGCGTGAGTTTGTAACACGTATTAACCGGAGTAATGATCGCACATTGTATTCGGACTCCGTGAAGGGCCGATTCACCATTTTGAGAGACAACGCCAAGAGCACAGTGTATCTACAAATGAACAGCCTGAAAACTGAGGACACGGCCGTTTATTATTGTTC
+
!8@BCFEGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGG:FGGGGCFGGFGGGGGGGGGFGGGGGGGGGGGGGGGGGEGGGGGGGGGFFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDFGGGGGGGGFFGGGGGFGFD@FEFFGGGDFGGG7F:BG,@B9DCCDGGGGGDBDFGFGGECG?*<BCCFEFGACCECCFEFF6BDCGGGGGCE6FF60CGGCFE7CG6CC7+<FG?8CGFCFG+CFFFGF0<1>E3D>EECF@7CFGFBF*)
Dear,
if possible, could asort and asorti be added to the bioawk source, these functions would be very interesting when parsing large biodata files and producing hash during the process to finally output the sorted by value hash in the END{} block.
thanks in advance
bioawk
segfaults when asked to parse an empty files
$ touch test.fastq
$ gzip test.fastq
$ bioawk -c fastx '{print}' test.fastq.gz
Segmentation fault
Actually, it also segfaults on non-gzipped input:
$ touch test.fastq
$ bioawk -c fastx '{print}' test.fastq
Segmentation fault
Is it something easily fixable?
make
yacc -d awkgram.y
make: yacc: Command not found
Makefile:50: recipe for target 'ytab.o' failed
make: *** [ytab.o] Error 127
no ytab.o file while Makefile is including one
solved after installing bison (not present under ubuntu by default
Maybe nice to add to the readme.
Thanks for the great tool
Self-explanatory, I guess.
Here's an example:
echo ">a\\nATAGA\\n\\n>b\\nAGATTAG\\n>c\\nAGATAG"| bioawk -cfastx '{print $name}'
I checked and kseq.h
does handle this correctly.
v1.1?
Unexpected termination returns
2 �
9590 >
by the command:
~/miniconda3/bin/bioawk '/^>/ {count[substr($0,1,1)]++}END{for(j in count) print count[j],j}'
SILVA_128_SSURef_tax_silva_full_align_trunc.fasta.gz
Reference gzip and mawk commands (doing the same count) are OK:
time zcat *fasta.gz|./bioawk '/^>/ {count[substr($0,1,1)]++}END{for(j in count) print count[j],j}'
1922213 >
real 47m44.564s
user 46m22.997s
sys 2m48.813s
time zcat *fasta.gz|mawk '/^>/ {count[substr($0,1,1)]++}END{for(j in count) print count[j],j}'
1922213 >
real 10m47.346s
user 11m31.835s
sys 1m44.985s
Is it similar to zgrep, which can search both in compressed and not compressed, or -c fastx should be explicit?
I wonder what is the license of bioawk?
Hi?
ARGIND is not support by bioawk?
If we support ARGIND , it will be a easy way to process one file based on another file.
such as finding the intersect part of them.
Best Regards
Hi- First, thanks for developing bioawk!
I noticed that the description of the gff format from bioawk -c help
has 10 fields, last two being "group" and "attribute".
bioawk -c help
...
gff:
1:seqname 2:source 3:feature 4:start 5:end 6:score 7:filter 8:strand 9:group 10:attribute
...
Shouldn't there be 9 fields, where the 9th is either called "attribute" or "group"? For reference: https://www.sanger.ac.uk/resources/software/gff/spec.html#t_2 or http://genome.ucsc.edu/FAQ/FAQformat.html#format3
Dario
how to solve this problem when trying to run make in bioawk file?
yacc -d awkgram.y
awkgram.y: warning: 43 shift/reduce conflicts [-Wconflicts-sr]
awkgram.y: warning: 85 reduce/reduce conflicts [-Wconflicts-rr]
mv y.tab.c ytab.c
mv y.tab.h ytab.h
gcc -g -Wall -O2 -c ytab.c
make: gcc: Command not found
Makefile:50: recipe for target 'ytab.o' failed
make: *** [ytab.o] Error 127
GTF file:
$ zcat ./test.gtf.gz
chr21 hg19_knownGene exon 33026871 33027740 0.000000 - . gene_id "uc002yoz.1"; transcript_id "uc002yoz.1";
chr21 hg19_knownGene exon 33030247 33030540 0.000000 - . gene_id "uc002yoz.1"; transcript_id "uc002yoz.1";
chr21 hg19_knownGene exon 33031710 33031813 0.000000 - . gene_id "uc002yoz.1"; transcript_id "uc002yoz.1";
chr21 hg19_knownGene start_codon 33032083 33032085 0.000000 + . gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
chr21 hg19_knownGene CDS 33032083 33032154 0.000000 + 0 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
chr21 hg19_knownGene exon 33031935 33032154 0.000000 + . gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
chr21 hg19_knownGene CDS 33036103 33036199 0.000000 + 0 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
chr21 hg19_knownGene exon 33036103 33036199 0.000000 + . gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
chr21 hg19_knownGene CDS 33038762 33038831 0.000000 + 2 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
chr21 hg19_knownGene exon 33038762 33038831 0.000000 + . gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
chr21 hg19_knownGene CDS 33039571 33039688 0.000000 + 1 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
chr21 hg19_knownGene exon 33039571 33039688 0.000000 + . gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
chr21 hg19_knownGene CDS 33040784 33040888 0.000000 + 0 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
chr21 hg19_knownGene stop_codon 33040889 33040891 0.000000 + . gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
chr21 hg19_knownGene exon 33040784 33041243 0.000000 + . gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
List some fields of GTF with bioawk:
$ ./bioawk -c gff '{ print $feature,$start,$group }' ./test.gtf.gz
exon 33026871 gene_id "uc002yoz.1"; transcript_id "uc002yoz.1";
exon 33030247 gene_id "uc002yoz.1"; transcript_id "uc002yoz.1";
exon 33031710 gene_id "uc002yoz.1"; transcript_id "uc002yoz.1";
start_codon 33032083 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
CDS 33032083 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
exon 33031935 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
CDS 33036103 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
exon 33036103 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
CDS 33038762 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
exon 33038762 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
CDS 33039571 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
exon 33039571 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
CDS 33040784 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
stop_codon 33040889 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
exon 33040784 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
It would be nice if for the GFF group field and the VCF info field, $group and $info,
is an array which used each subfeature as key:
$ ./bioawk -c gff '{ print $feature,$start,$group[gene_id],$group[transcript_id] }' ./test.gtf.gz
exon 33026871 uc002yoz.1 uc002yoz.1
exon 33030247 uc002yoz.1 uc002yoz.1
exon 33031710 uc002yoz.1 uc002yoz.1
start_codon 33032083 uc002ypa.3 uc002ypa.3
CDS 33032083 uc002ypa.3 uc002ypa.3
exon 33031935 uc002ypa.3 uc002ypa.3
CDS 33036103 uc002ypa.3 uc002ypa.3
exon 33036103 uc002ypa.3 uc002ypa.3
CDS 33038762 uc002ypa.3 uc002ypa.3
exon 33038762 uc002ypa.3 uc002ypa.3
CDS 33039571 uc002ypa.3 uc002ypa.3
exon 33039571 uc002ypa.3 uc002ypa.3
CDS 33040784 uc002ypa.3 uc002ypa.3
stop_codon 33040889 uc002ypa.3 uc002ypa.3
exon 33040784 uc002ypa.3 uc002ypa.3
At the moment it is not supported and I get the following error:
./bioawk: can't assign to group; it's an array name.
source line number 1
I noticed that NR
is off by the size of a VCF header, as if NR
was set before header is removed (the header normally doesn't print with -c vcf
flag).
This means if my VCF has header of 10 lines, the first variant is at NR==11
, which is not expected.
I get some warnings when compiling bioawk:
$ make
yacc -d awkgram.y
conflicts: 43 shift/reduce, 85 reduce/reduce
mv y.tab.c ytab.c
mv y.tab.h ytab.h
gcc -g -Wall -O2 -ansi -c ytab.c
gcc -g -Wall -O2 -ansi -c -o b.o b.c
gcc -g -Wall -O2 -ansi -c -o main.o main.c
gcc -g -Wall -O2 -ansi -c -o parse.o parse.c
gcc -g -Wall -O2 -ansi maketab.c -o maketab
./maketab >proctab.c
gcc -g -Wall -O2 -ansi -c -o proctab.o proctab.c
gcc -g -Wall -O2 -ansi -c -o tran.o tran.c
gcc -g -Wall -O2 -ansi -c -o lib.o lib.c
gcc -g -Wall -O2 -ansi -c -o run.o run.c
gcc -g -Wall -O2 -ansi -c -o lex.o lex.c
lex.c: In function ‘gettok’:
lex.c:157:9: warning: ignoring return value of ‘strtod’, declared with attribute warn_unused_result
gcc -g -Wall -O2 -ansi -c -o addon.o addon.c
addon.c: In function ‘bio_getrec’:
addon.c:274:3: warning: implicit declaration of function ‘fileno’
gcc -g -Wall -O2 -ansi ytab.o b.o main.o parse.o proctab.o tran.o lib.o run.o lex.o addon.o -o awk -lm -l
fileno is a POSIX extension:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37771
http://stackoverflow.com/questions/9427145/how-do-i-remove-the-following-implicit-declaration-of-function-warnings
Compiling bioawk with the following flags (without -ansi, doesn't give the addon.c:274:3: warning: implicit declaration of function ‘fileno’ error.
make CFLAGS='-g -Wall -O2'
% awk 'BEGIN{print 0x4}'
4
% bioawk 'BEGIN{print 0x4}'
0
Are there some test? I want run some tests on different OS tocompare their running time.thanks very much!
eg.
bioawk -c fastx -c2 sam 'NR==FNR{...}NR!=FNR{...}' test.fq test.sam
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.