bioinformatics-centre / kaiju Goto Github PK

View Code? Open in Web Editor NEW

258.0 14.0 68.0 991 KB

Fast taxonomic classification of metagenomic sequencing reads using a protein reference database

Home Page: http://kaiju.binf.ku.dk

License: GNU General Public License v3.0

C++ 14.13% Makefile 0.31% C 84.09% Perl 0.11% Shell 1.36%

metagenomics next-generation-sequencing bioinformatics taxonomic-classification taxonomy

kaiju's People

Contributors

Stargazers

Watchers

Forkers

binma shenwei356 skerker hscleandro kh49 mz-cy-han1998 minghao2016 zhssakura flopezo emrobe brooksph dfornika a7032018 raul-arias pbordron brwnj koopkaup sunnycqcn igacat antunderwood yesimon sevcanaydn ravinpoudel pasted jameyzhu wangdi2014 promexjm pythseq yh214 ahmed-shibl takimailto fw1121 kristapsbe kdbrumfield biosharp-dotnet-labs oddaud sailfish009 mitiku90 rajaldebnath zhaoxia413 ericdeveaud chen318liang yuemo98 zhenhaoxiong bielasilva comingkms odinidoer sreevatshan fhcampbell gerbenvoshol slw287r tintingli myyaoyaoyao vincenzopennone lucasms rnshah9 jorgeborja19 alexsongh asulit08 granek abdo3a hectorta1989 bingli2019 joon-klaps jing-xinxing datasoc-ltd odongoisaya

kaiju's Issues

Make a custom database failed

I was tried to make a custom database based on arthropoda proteins. I created a the database fasta file named kaiju_test and ran in the kaiju/bin file:

mkbwt -n 5 -a ACDEFGHIKLMNPQRSTVWY -o proteins kaiju_test

and this is what I got:

-bash: kaiju: command not found

The file mkbwt is in the kaiju/bin directory. What could be the problem? Thanks for your time!

Problem with intermediate ranks when creating report file

I am not sure whether that is a feature or a bug but kaijuReport includes intermediate ranks in the created report file resulting in percentage values whose sum is >100%.

Example: The full lineage of species Acinetobacter baumannii has Acinetobacter calcoaceticus/baumannii complex (species group). The report file (created for species rank) contains then A. baumannii (57.76%) and A. calcoaceticus/baumannii complex (52.67%).

It would be great if one could avoid that by using an additional parameter when calling kaijuReport.

Edit 1: Used version: v1.4-6-gc547bc3

Edit 2: Moreover, this issue affects the percentage and number of sequences classified above the given rank.

addTaxonNames - naming hierarchy inconsistent among sequences

When using the addTaxonNames helper function to add the full taxon path to each sequence, the number of elements in the taxon path is inconsistent, making it very difficult to consistently identify a given taxon level when working with the data in other software (e.g. R). For example, to extract all sequences annotated as belonging to a certain family, I would parse the full taxon path into substrings based on the semicolon divider - but the position of 'family' in the full taxon path ranges from the 5th to 10th element depending on the sequence. It would be possible to extract family names by looking for the -aceae ending, but this does not work for genera or for taxonomic levels that do not have a consistent naming convention. Thus it is currently impossible to do this type of analysis where simultaneous information on multiple taxonomic levels is needed.

Similarly, the most precise taxonomic identification is always added as the final element of the full taxon path, so there is always a duplicate element added that makes it difficult to interpret whether a given taxon name is the name assigned at a given rank, or if that is simply the most precise rank available, or is in fact some other type of annotation (for example "environmental samples" is given as a taxonomic rank in some cases).

Would it be possible to add an option to force the full taxon path output from addTaxonNames to include only a given subset of taxonomic levels (e.g. phylum/class/order/family/genus/species), or to force consistency so that a given element within the taxon path always represents a given taxonomic level (e.g. if family is always the 10th element it can be extracted).

And would it be possible to add an option to suppress the duplicate reporting of the most precise taxonomic identification possible as part of the full taxon path? This could still be reported in a separate field but as it stands it is placed at the end of the full taxon path.

Here is an example of the output of addTaxonNames for a dataset analyzed using kaiju and the nr+euk database, the same issue is present when using the complete genomes and nr databases.

C M02360:6:000000000-AC61R:1:1101:22111:5800 1033 341 1033, SLQPNSGAQGEYAGLLAIRAYHRSRGEAHRKVCLIPSSAHGTNPASASMVGMDVVVVACDARGDVDVEDLR, cellular organisms; Bacteria; Proteobacteria; Alphaproteobacteria; Rhizobiales; Bradyrhizobiaceae; Afipia; Afipia; C M02360:6:000000000-AC61R:1:1101:23444:5841 418856 72 418856, PAVLPDDQTYTHR, cellular organisms; Bacteria; Proteobacteria; Gammaproteobacteria; Nevskiales; Sinobacteraceae; Nevskia; Nevskia soli; Nevskia soli; C M02360:6:000000000-AC61R:1:1101:14687:5896 1248916 258 1248916, IEKESSVNEAIDRMRHSATRALLERDDVIIVASVSCLYGIGSVETYSAMTFALK,IEKESSVNEAIDRMRHSATRALLERDDVIIVASVSCLYGIGSVETYSAMTFALK, cellular organisms; Bacteria; Proteobacteria; Alphaproteobacteria; Sphingomonadales; Sphingomonadaceae; Sandarakinorhabdus; Sandarakinorhabdus sp. AAP62; Sandarakinorhabdus sp. AAP62; C M02360:6:000000000-AC61R:1:1101:25907:5899 1262833 73 1262833, DILIIGGGITGLSSAYF,DILIIGGGITGLSSAYF, cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Clostridia; Clostridiales; Clostridiaceae; Clostridium; environmental samples; Clostridium sp. CAG:710; Clostridium sp. CAG:710; C M02360:6:000000000-AC61R:1:1101:17581:5925 1302620 246 1302620, DHAAETGAAIPTEPVVFMKDPSTVVGPFDEVLVPRGSTKTDWEVELGVVIG,DHAAETGAAIPTEPVVFMKDPSTVVGPFDEVLVPRGSTKTDWEVELGVVIG, cellular organisms; Bacteria; Terrabacteria group; Actinobacteria; Actinobacteria; Micrococcales; Cellulomonadaceae; Cellulomonas; Cellulomonas sp. URHD0024; Cellulomonas sp. URHD0024; C M02360:6:000000000-AC61R:1:1101:17429:5938 1016849 244 1016849, YNPETDGKELVRAFRNIPGVETSSVFALNLLQLAPGGHLGRFIIWTSSAFSAL,YNPETDGKELVRAFRNIPGVETSSVFALNLLQLAPGGHLGRFIIWTSSAFSAL, cellular organisms; Eukaryota; Opisthokonta; Fungi; Dikarya; Ascomycota; saccharomyceta; Pezizomycotina; leotiomyceta; Eurotiomycetes; Chaetothyriomycetidae; Chaetothyriales; Herpotrichiellaceae; Exophiala; Exophiala sideris; Exophiala sideris; C M02360:6:000000000-AC61R:1:1101:23427:5942 1760 122 1760,1883,1888,1892,1907,1930,1938,1963,29303,35619,39478,40318,44060,47758,49185,54571,55952,58344,66378,66874,66875,67257,67298,67373,68213,68231,68249,68268,73044,76728,78355,83656,89050,100226,105425,114687,131568,146537,159449,193462,227882,249567,285535,352211,455632,457427,498367,500153,645465,661399,1055352,1078086,1136432,1156844,1157634,1157635,1157637,1172179,1203592,1288080,1288083,1380346,1380770,1428628,1463841,1463850,1463853,1463856,1463857,1463858,1463877,1463888,1463901,1463917,1463920,1463926,1463932,1576605,1580539,1592326,1592327,1592330,1592727,1650571,1736503, GRWKAVVVGSYERGDRAVTVQRLAELADFY, cellular organisms; Bacteria; Terrabacteria group; Actinobacteria; Actinobacteria; Actinobacteria; C M02360:6:000000000-AC61R:1:1101:20712:8466 290425 165 302406,310575, IVVLTPGMYNSAYFEHTFLAQQMGVELVEGKDL,IVVLTPGMYNSAYFEHTFLAQQMGVELVEGKDL,TIVVLTPGMYNSAYFEHTFLAQQMGVELVEGKDL,TIVVLTPGMYNSAYFEHTFLAQQMGVELVEGKDL, cellular organisms; Bacteria; Proteobacteria; Betaproteobacteria; Burkholderiales; Alcaligenaceae; Advenella; Advenella; C M02360:6:000000000-AC61R:1:1101:13179:8502 1391654 91 1391654, RQMIQAHIAKKATATVAAIPVPR,RQMIQAHIAKKATATVAAIPVPR, cellular organisms; Bacteria; Proteobacteria; delta/epsilon subdivisions; Deltaproteobacteria; Myxococcales; Sorangiineae; Labilitrichaceae; Labilithrix; Labilithrix luteola; Labilithrix luteola; C M02360:6:000000000-AC61R:1:1101:16385:8539 1120523 173 1120523, LAPNTSVTCTANYTVTQADVDSGKVTNTATATGTPPTG, cellular organisms; Bacteria; Terrabacteria group; Actinobacteria; Actinobacteria; Streptomycetales; Streptomycetaceae; Streptomyces; Streptomyces sp. MJM8645; Streptomyces sp. MJM8645;

Compile error

I get this error when compiling the newest version on Kaiju (using gcc version 5.2.0):

kaiju2krona.cpp: In function ‘int main(int, char**)’:
kaiju2krona.cpp:37:47: error: ‘getopt’ was not declared in this scope
  while ((c = getopt (argc, argv, "huvn:t:i:o:")) != -1) {
                                               ^
make: *** [kaiju2krona.o] Error 1

Run makeDB.sh again when it fails

I had a problem when extracting protein sequences:

Can't locate IO/Uncompress/AnyUncompress.pm in @INC (@INC contains: /storage/software/perl-5.22.0/lib /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at /gpfs/hpchome/a51256/bin/gbk2faa.pl line 15.
BEGIN failed--compilation aborted at /gpfs/hpchome/a51256/bin/gbk2faa.pl line 15.

I made some changes and rerun makeDB.sh, but it started to download all genomes again (RefSeq). Is it possible to make this script check first if all genome files have been downloaded and then skip this step?

Kristjan

paired and unpaired reads

Hi,
I want to analyse a set of reads with Kaiju.
After the trimming step of my paired-end data, a part of the reads lost their mates and I obtain a fastq file with the right mate, a fastq file with the left mate and a fastq file with singletons.
Please, could you tell me if is it possible to include all the reads (paired and singleton) in the kaiju analysis ?

Thank you in advance for your reply,

addTaxonNames – Uneven taxonomic hierarchies due to missing high-level taxonomic information

When using updated code allowing for choosing specific taxonomic levels in addTaxonNames (e.g. addTaxonNames -r phylum,class,order, family, genus, species), I obtain the correct set of taxonomic levels, but comparison of hierarchies among taxa is often incoherent because some high taxonomic levels (e.g. order, family) are missing or are unresolved in many taxa that still have proper genus and species names.

Example of kaiju output:
C M02360:6:000000000-AC61R:1:1101:23278:4875 59803 12 59803, GDAPLFPFGYGL, Proteobacteria; Alphaproteobacteria; Sphingomonadales; Sphingomonadaceae; Sphingomonas; Sphingomonas echinoides;
C M02360:6:000000000-AC61R:1:1101:8983:4922 360054 11 360054, KILVHGHRGAR, Acidobacteria; Solibacteres; Solibacterales; Bryobacter; Bryobacter aggregatus;

Here Sphingomonas echinoides is fully resolved, while Bryobacter aggregatus lacks a family. (The GenBank taxon file specifies "unclassified Solibacterales" in lieu of family with a label "no rank".) As a result, a given position in the taxonomic names vectors may indicate uncomparable taxonomic levels among sequences.

Following on an initial suggestion by skembel (issue #4), would it be possible to output a NA value in the output of taxonomic levels when a level is missing, such that annotations performed up to a same taxonomic level will be comparable when parsed into columns?

Segfault using proGenomes

Hi,
When I use proGenomes as reference then running kaiju gives me segafult (greedy and MEM options both) within seconds. I ran it on a HPC cluster using 32gb RAM and 8 cores. My sample files are Illumina 2x150 bp paired-end fasta files (about 2,5gb each).
24855 Segmentation fault kaiju -t ~/kaiju/DB/nodes.dmp -f ~/kaiju/DB/kaiju_db.fmi -i trimmed-fasta/K1.R1.trimmed.fasta -j trimmed-fasta/K1.R2.trimmed.fasta -o K1.kaiju.tax.out -z 8 -a greedy -e 10

However when I used RefSeq as reference then it worked. Any ideas what could be the problem or should I change some parameters?

Thanks

Summary file

Hi ,
I have noticed that summary file are different if your input is a dna seq or a protein.
In the case of the proteins my output looks like this:

C       k141_11175_1    1282    Firmicutes; Bacilli; Bacillales; Staphylococcaceae; Staphylococcus; Staphylococcus epidermidis; 
C       k141_6414_1     31957   Actinobacteria; Actinobacteria; Propionibacteriales; Propionibacteriaceae; NA; NA;

If input is DNA

C	k141_336_2	1574624	511	1574624,	WP_002548245.1,WP_063811380.1,	MNTLHSGILVVDKPAGVTSHQVVGRVRRLMGTRKVGHAGTLDPMANGVLIVGINRATRLLGHLSLHDKDYTATMRLGVGTVTDDAEGEVTATTDASAIDDE,	Actinobacteria; Actinobacteria; Propionibacteriales; Propionibacteriaceae; Cutibacterium; [Propionibacterium] namnetense;

Why such difference ? Not talking about the taxonomy itself (it is just a random example) but rather a difference in the output format.

Inspect indexes ?

Hi,
Is it possible to inspect which species are present in the indexed files download from your website. I would like to check if some species are present ?

Slow reading nr database

Hi,
I have downloaded latest indexes from the Kaiju website and running paired-end classifification of a sample reads. It´s strange but it takes almost two hours to complete the reading of the files (see time below). I´m running kaiju using a linux cluster with SGE (setting mem and threads parameters as follows: -l h_vmem=60G and -pe parallel_smp 10).

Why is this taking so long ??

kaiju -v -z 10 -a greedy -m 5 -s 70 -x -t /db/kaiju/nodes.dmp -f /db/kaiju/kaiju_db_nr_euk.fmi -i 1.contigs2reads.R1.clean.fq -j 1.contigs2reads.R2.clean.fq -o 1.out.ncbi_id ;

11:55:10 Reading database
 Reading taxonomic tree from file /db/kaiju/nodes.dmp
 Reading index from file /db/kaiju/kaiju_db_nr_euk.fmi
Output file: files/taxonomy_reads_mapped_to_contigs//1/1.1.contigs2reads.R2.fq.ncbi_id
13:50:27 Start classification using 10 threads.
13:56:45 Finished.

*.fmi file is not generated by makeDB.sh

Hello,
I'm running the latest version of kaiju (1.5.0) and have been trying to construct the reference database and index using makeDB.sh. But, it seems that the script terminates prematurely and no fmi file is generated.

Below is the stdout I got by makeDB.sh.

###Start###
memb-main@membmain-empty[kaijudb] makeDB.sh -e -t 12
Downloading file taxdump.tar.gz
2017-05-02 10:08:37 URL: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz [2752] -> ".listing" [1]
2017-05-02 10:09:20 URL: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz [39222156] -> "taxdump.tar.gz" [1]
Extracting file taxdump.tar.gz
Downloading file nr.gz
2017-05-02 10:09:25 URL: ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz [3725] -> ".listing" [1]
2017-05-02 13:14:40 URL: ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz [29460199431] -> "nr.gz" [1]
Downloading file prot.accession2taxid.gz
2017-05-02 13:14:44 URL: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz [1623] -> ".listing" [1]
2017-05-02 14:04:00 URL: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz [2957842802] -> "prot.accession2taxid.gz" [1]
Unpacking prot.accession2taxid.gz
Converting NR file to Kaiju database
Reading taxonomic tree from file nodes.dmp
Reading taxa from file /mnt/Storage/Metagenome/kaiju/bin/taxonlist.tsv
Reading accession to taxon id map from file prot.accession2taxid
Processing NR file
Creating BWT from Kaiju database

infilename= kaiju_db_nr_euk.faa

outfilename= kaiju_db_nr_euk

Alphabet= ACDEFGHIKLMNPQRSTVWY

nThreads= 12

length= 0.000000

checkpoint= 5

caseSens=OFF

revComp=OFF

term= *

revsort=OFF

help=OFF

Sequences read time = 522.420483s
SLEN 36316848503
NSEQ 102635280
ALPH *ACDEFGHIKLMNPQRSTVWY
SA NCHECK=0
Sorting done, time = 42943.242368s
memb-main@membmain-empty[kaijudb]
###End###

The makeDB.sh terminates immediately after generating *.bwt and *.sa file. How can I solve this problem and get *.fmi file?

Thanks.

Core Dump during makeDB.sh -e

Hello- I'm having an issue creating the Kaiju DB including eukarya with the following error... any help would be great!

# infilename= kaiju_db_nr_euk.faa
# outfilename= kaiju_db_nr_euk
# Alphabet= ACDEFGHIKLMNPQRSTVWY
# nThreads= 2
# length= 0.000000
# checkpoint= 5
# caseSens=OFF
# revComp=OFF
# term= *
# revsort=OFF
# help=OFF 
readFasta: Failed to alloc seq of length 26921682502
Sequences read time = 0.000400s
Segmentation fault (core dumped)

Feature request

Support input of gzip reads files through detection of file format rather than using the gunzip -c approach).

makeDB does not create kaiju_db.fmi

Hi,
I'm new to Kaiju and would like to test it for our purpose as alternative to Qiime.
The installation was straight forward (thanks!) however when downloading the NCBI database, the kaiju_db.fmi file is not created, meaning that I'm unable to run kaiju
My call was:
$../bin/makeDB.sh -v
and the screen ouput looks OK:
Extracting protein sequences from downloaded files...
Creating Borrows-Wheeler transform...

infilename= kaiju_db.faa

outfilename= kaiju_db

Alphabet= ACDEFGHIKLMNPQRSTVWY

nThreads= 5

length= 0.000000

checkpoint= 3

caseSens=OFF

revComp=OFF

term= *

revsort=OFF

help=OFF

Sequences read time = 86.800000s
SLEN 5926785889
NSEQ 18269638
ALPH *ACDEFGHIKLMNPQRSTVWY
SA NCHECK=0
Sorting done, time = 5752.120000s

$ ls
assembly_summary.archaea.txt
assembly_summary.bacteria.txt
downloadlist.txt
genomes
kaiju_db.bwt
kaiju_db.faa
kaiju_db.sa
names.dmp
nodes.dmp
taxdump.tar.gz

Wrong size of downloaded index

Yesterday, I am ready to download index from http://kaiju.binf.ku.dk/server , but today it seem to transfer location to http://159.226.251.230:80/videoplayer/kaiju_index_nr_euk.tgz?ich_u_r_i=ad3ed0865c98c0c04f7acae0be6268d5&ich_s_t_a_r_t=0&ich_e_n_d=0&ich_k_e_y=1745018910750163332482&ich_t_y_p_e=1&ich_d_i_s_k_i_d=9&ich_u_n_i_t=1

It there are any problems?? May I ask when it can be fixed??

Here my download command.
wget -c -t 0 http://kaiju.binf.ku.dk/database/kaiju_index_nr_euk.tgz

Thanks.

Segmentation fault (core dumped)

Hi!
I am trying to use kaiju as a quick aligner of sequences and I do not need taxonomic classification and, on the contrary, want to save original fasta headers in the database. I built the database with mkbwt and mkfmi without changing headers in fasta file. When I launch kaijup, it is perfectly work for a while and stop with a "Segmentation fault (core dumped)". I am using nr and a big query file, but I am monitoring resources usage and there are plenty of them on my machine (near 300Gb of RAM is still available at the "core dump" point). But the most interesting point, that I relaunched it many times with the same everything and it crashed at different moments. So I am not sure that this is because of headers in the database. Do you have any idea what can cause this failure?
At the same time, did not you consider to make a special mode for using kaiju just as "blast" alternative? It is madly fast and for many metagenomics tasks it would be very-very useful. If the output included lengths of both aligned sequences, original identifiers (fasta headers) and length of match, it would be much more universal.
Thank you for your work!
Ksenia

about the database

Hi, thanks for this lovely program,

I'm wondering if I can extract some information about the statistics of each rank, e.g. genus, from the downloaded latest NCBI BLAST nr +euk database.
What the percentage of protein sequence and the exact number of amino acids of each genus are.

Thanks much

OTU table

Hi,
I was wondering if you have already a script to build an OTU table from a kaiju output ??
In my case i have used Kaiju to assign taxonomy to different bins.

Thanks,
david

missing convertNR ?

makeDB.sh in a fresh git clone (3/1/2016) has a dependency on convertNR (line 72) which is not part of the makefile nor in the utils/ directory.

Error when merging outputs from Kraken and Kaiju with -c option

command:
mergeOutputs -i 56Fnonhunman_kaiju_sort -j 56Fnonhunman_kraken.output_sort -v -c lowest -t /home/zong/programs_and_scripts/kaiju/nodes.dmp -o combined.out

terminate called after throwing an instance of 'std::out_of_range'
what(): _Map_base::at
Aborted (core dumped)

merging succeeded with -c 1 and -c 2, but failed when using -c lowest and -c lca

thanks

Segmentation fault in various lines

Hello,

I'm running Kaiju on a cluster and I keep running into segfault issues. And different lines are causing this issue every time I run the script which is below and so is the error. Any ideas why this might be happening please?

SCRIPT:

#!/bin/bash

#SBATCH --job-name="kaiju"
#SBACTH --partition=batch
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=2G
#SBATCH --time=14-00:00:00
#SBATCH [email protected]
#SBATCH --mail-type=END,FAIL
#SBATCH -e slurm-%j.err-%N
#SBATCH -o slurm-%j.out-%N

module load kaiju/f81d2ca
module load db-kaiju/20170113-e

#change these paths according to where the project files are
SEQS=/rhome/taruna/shared/GOMRI/data-clean/GOM_concat1.7_allF04combo_10Jan14.fna
OUT=/bigdata/biklab/taruna/gomri-kaiju.out

#change output directory name

kaiju \
-t $KAIJU_DB \
-f $KAIJU_DB \
-i $SEQS \
-o $OUT

ERROR:

/var/spool/slurmd/job99594/slurm_script: line 28:  7035 Segmentation fault      kaiju -t $KAIJU_DB -f $KAIJU_DB -i /rhome/taruna/shared/GOMRI/data
-clean/GOM_concat1.7_allF04combo_10Jan14.fna -o /bigdata/biklab/taruna/gomri-kaiju.out

Thanks for your help!

error:Taxon ID in output file is not contained in taxonomic tree file nodes.dmp

Hi
After running kaiju to annotate a marine virus database,I tried to use kt2krona file to illustrate the result.The programs shows many "Taxon ID xx in output file is not contained in taxonomic tree file nodes.dmp" on the screen and the sequences are mostly annotated as influenza virus. The nodes.dmp is already newest(105.7MB).

the error is like this:
Warning: Taxon ID 7006495 in output file is not contained in taxonomic tree file nodes.dmp.
Warning: Taxon ID 7517995 in output file is not contained in taxonomic tree file nodes.dmp.

and I can find the ID No. in the .out file like:
C gene_24783|GeneMark.hmm|115_aa|-|336|683 7006495

So is this taxon ID in the names.dmp file or in the .fmi file?

make error

I get the following error when invoking make

[rosema1@demeter kaiju-master]$ cd src/
[rosema1@demeter src]$ make
make -C bwt/
make[1]: Entering directory /home/rosema1/BioInfo/bin/kaiju-master/src/bwt' gcc -O3 -g -Wno-unused-result -c -o mkbwt.o mkbwt.c gcc -O3 -g -Wno-unused-result -c -o readFasta.o readFasta.c gcc -O3 -g -Wno-unused-result -c -o suffixArray.o suffixArray.c gcc -O3 -g -Wno-unused-result -c -o multikeyqsort.o multikeyqsort.c gcc -O3 -g -Wno-unused-result -c -o sequence.o sequence.c gcc mkbwt.o readFasta.o suffixArray.o multikeyqsort.o sequence.o -lpthread -lm -o mkbwt gcc -O3 -g -Wno-unused-result -c -o mkfmi.o mkfmi.c gcc -O3 -g -Wno-unused-result -c -o bwt.o bwt.c gcc -O3 -g -Wno-unused-result -c -o compactfmi.o compactfmi.c gcc mkfmi.o bwt.o suffixArray.o compactfmi.o -lpthread -lm -o mkfmi make[1]: Leaving directory /home/rosema1/BioInfo/bin/kaiju-master/src/bwt'
g++ -ansi -pedantic -O3 -pthread -std=c++11 -g -DNDEBUG -Wall -Wconversion -Wno-unused-function -I./include/ProducerConsumerQueue/src -I./include/ncbi-blast+ -c -o kaiju.o kaiju.cpp
cc1plus: error: unrecognized command line option "-std=c++11"
make: *** [kaiju.o] Error 1

Help would be appreciated.

Mark

Prefix all commands with kaiju

The following commands are too generic and cause problems being in the $PATH. Please consider prefixing them, like kaiju-makeDB.sh

addTaxonNames
convertNR
gbk2faa.pl
makeDB.sh
mergeOutputs
mkbwt
mkfmi

These are ok:

kaiju
kaiju2krona
kaijuReport
kaijup
kaijux

how to merge kaijuReport output files of multiple samples into one file?

Hi Peter
I have get 16 kaijuReport output files, and want to merge them together.
Does kaiju have any program for that? I suggest you to add this function into kaiju anyway.
Another question is that if kaiju takes genome size into account? Since large genome should have more reads sequenced comparing to small genome species.
By the way, thanks for such a good tool. I have tried kraken, but even less then 10% of the metagenomic reads were classified, while kaiju gives me about 65%. with the greedy-5 mode and nr database

Core Dump after running kaiju

Missing argument -p for kaijuReport

kaijuReport doesn't take -p as an argument for printing full taxon paths.

OS X ld does not support --whole-archive option

Since 1.4.5, building kaiju from source fails on OSX.

gcc   mkbwt.o readFasta.o suffixArray.o multikeyqsort.o sequence.o  -Wl,--whole-archive -lpthread -Wl,--no-whole-archive -lm -o mkbwt
ld: unknown option: --whole-archive
clang: error: linker command failed with exit code 1 (use -v to see invocation)

I managed to build using -all_load and -noall_load instead of --whole-archive and --no-whole-archive, respectively, but this will not build on Linux.

I guess you could use something like

OS := $(shell uname)
ifeq $(OS) Darwin
LDLIBS = -Wl,-all_load -lpthread -Wl,-noall_load -lm 
else
LDLIBS = -Wl,--whole-archive -lpthread -Wl,--no-whole-archive -lm
endif

in both makefiles as a quick fix to the problem, or use autoconf to build portable binaries.

Hope this helps,
Hadrien.

PS: Thank you for your work, your software is awesome! 🎉 🐨 😃

Problem with rank Order.

Hi ,
while running kaiju i noticed a problem with Class and order taxonomy. See the following example

C       k141_1685_2     169292  689     169292, WP_049377218.1, TELRRFRSDQGVKPSQKVPGRLDFAAADLAGQEELVRNLANTTAPGEDFDPSASIEVRLSQATVEVTLDTHGAVDVEAERKRLEKDLAKANKELEQTGKKLGNENFLSKAPEEVVNK
IKERQQIAREEVERITSRLEGLK,        Actinobacteria; Corynebacteriales; Actinobacteria; Corynebacteriaceae; Corynebacterium; Corynebacterium aurimucosum;

NCBi taxid 169292 corresponds to :
cellular organisms; Bacteria; Terrabacteria group; Actinobacteria; Actinobacteria; Corynebacteriales; Corynebacteriaceae; Corynebacterium;Corynebacterium aurimucosum

I´m trying to get "Phylum","Class","Order","Family","Genus","Species" through the command
addTaxonNames -t nodes.dmp -n names.dmp -i kaiju_output.ncbi_id -r phylum,order,class,family,genus,species

The rank order based on NCBI is Corynebacteriales but Kaiju returns Actinobacteria

Maybe i´m doing something wrong ?? I have downloaded taxdump.tar.gz as suggested.

Plasmids as a reference database

Is it possible to add plasmids protein sequences from Refseq as a reference database?
ftp://ftp.ncbi.nih.gov/refseq/release/plasmid/

SEG filtering on the nr protein database?

Hi,

Thanks for Kaiju, it's very nice!

Is the SEG filtering for low-complexity sequences performed on the nr database when makeDB.sh is run? I understand that I can add this functionality to my input sequences when running kaiju -x, but I'm wondering if the nr database has also been filtered? If not, would you suggest running SEG independently on the nr sequences before building the kaiju database?

Thanks!

Output only one taxon with kaijuReport

This is a feature request to specify which taxon is shown with kaijuReport. For example only bacteria or archaea.

Feature request: add full path to kaijuReport

Is it possible to add full taxonomic path up to the specified rank?
For example if the rank is family then also columns for kingdom, phylum, class and order are inserted.

Thanks,
Kristjan

Documentation: -m / -c options are mutually exclusive

I was initially puzzled about the -c and -m options since the usage (below) does not make explicit that they are mutually exclusive. The error message when using both is fine. I suppose it's obvious in retrospect, but the initial usage could spell this out.

makeDB.sh: wget --show-progress not in old versions of wget

I'm on a system using wget version 1.12. Running makeDB.sh on a system with a version of wget older than 1.16 causes the makeDB.sh script to fail. A suggested work around is found here:
http://stackoverflow.com/questions/4686464/how-to-show-wget-progress-bar-only/32491843#32491843

Perhaps makeDB.sh could be updated to support older version of wget?

Compiling error (tried using gcc version 5.1.1 and 5.4.0):

Hi tried updating the compiler to 5.4.0 but still same error.
This link suggests that the problem could be that -lpthread is included several times although I fail to see where.

make
make -C bwt/
make[1]: Entering directory `/project/genomics/Gisle/Bin/Programs/kaiju-master/src/bwt'
gcc -O3 -g -Wno-unused-result   -c -o mkbwt.o mkbwt.c
gcc -O3 -g -Wno-unused-result   -c -o readFasta.o readFasta.c
gcc -O3 -g -Wno-unused-result   -c -o suffixArray.o suffixArray.c
gcc -O3 -g -Wno-unused-result   -c -o multikeyqsort.o multikeyqsort.c
gcc -O3 -g -Wno-unused-result   -c -o sequence.o sequence.c
gcc   mkbwt.o readFasta.o suffixArray.o multikeyqsort.o sequence.o  -Wl,--whole-archive -lpthread -Wl,--no-whole-archive -lm -o mkbwt
/usr/lib64/libpthread_nonshared.a(pthread_atfork.oS): In function `__pthread_atfork':
/usr/src/packages/BUILD/glibc-2.11.3/nptl/pthread_atfork.c:57: multiple definition of `__pthread_atfork'
/usr/lib64/libpthread_nonshared.a(pthread_atfork.oS):/usr/src/packages/BUILD/glibc-2.11.3/nptl/pthread_atfork.c:57: first defined here
/usr/lib64/libpthread_nonshared.a(pthread_atfork.oS): In function `__pthread_atfork':
/usr/src/packages/BUILD/glibc-2.11.3/nptl/pthread_atfork.c:57: multiple definition of `pthread_atfork'
/usr/lib64/libpthread_nonshared.a(pthread_atfork.oS):/usr/src/packages/BUILD/glibc-2.11.3/nptl/pthread_atfork.c:57: first defined here
collect2: error: ld returned 1 exit status
make[1]: *** [mkbwt] Error 1
make[1]: Leaving directory `/project/genomics/Gisle/Bin/Programs/kaiju-master/src/bwt'
make: *** [bwt/mkbwt] Error 2

Found bad number (out of range error)

was trying kaiju out using the kaiju_index.tgz db and ran the command:
kaiju-v1.5.0-linux-x86_64-static/bin/kaiju -t *.dmp -f *.fmi -i firstfile.fq -j secondfile.fq -z 32

pls advise if this is expected?

Thank you.

Contigs classification

Hi,
I have used a set of contigs (corresponding to a one specific bin) to assign taxonomy.
The report file could be misleading if your input data are contigs or bins...

I have attached a screenshot of my xls file. I have computed the summary output file from kaiju.
Kaiju says that 13,29% of reads correspond to B.subtilis.
The problem is that is counting all contigs euqally, meaning if a contig alignment is bigger than then the other is not taken into account. In such case if you add the length of the contigs (column sum in attached file) you can see that B.subtilis represents 84,55% of the total length.
In such case it´s a synthetic mock that contains B.subtilis so 84% is more close to the reality.
I think when running contigs of other fasta sequences than reads the summary file should report the sum of length rather the the number of reads classified or release a warn since that could be misleading.

Thanks

Kaiju NR

Is its possible to include all the Eukaryotic sequences in the kaiju database as well when making the database from the NCBI NR fasta file. Currently covertNR seems to get rid of anything that is not prokaryotic or viral. I expect certain metagenomes that I am analyzing to contain fungi and protists (based on Kraken results) , which I hope to classify using your wonderful tool.

Taxonomy assignment

Hi,
I have the following DNAsequence that cannot be properly assigned with Kaiju. Ncbi blast returns a staphylococcus epidermidis. I have used Kaiju web server with default parameters and Nr+eu database. I hava only submitted this sequence.

Thanks ,

>k141_447_50 # 42411 # 43247 # -1 # ID=423_50;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.323
ATGGTGAATTATAAAGAGAAGTTTGCAGAAGCCAAGACAATTGCCGTAAATGAGGGGTTTGAATCAACTC
GTGCCGAATGGTTGTTTTTAGATGTTTTTGGTTGGTCGAAAACAGATTATTTAATTCATAAAGATGAGCA
AATGTCTTTGACATCAATTAACAAATTGGATAAAGCGTTGGATAGAATGATCACAGGAGAACCTATTCAA
TACATTGTTGGATTTCAGTCTTTTTATGGTTATCAATATAAAGTGAATCAACACTGTCTTATACCAAGGC
CTGAAACCGAGGAAGTTATGTTGCATTTTTTAGAATTGTGTAAAAAGACTGATACCATAGCAGATATTGG
AACTGGAAGTGGTGCTATAGCAATTACGCTTAAGTTACTGCAACCTGAATTAAATGTTATTGCAACAGAT
TTGTATGAAGATGCTTTAAATGTAGCTAAGCAAAATGCTAGTCATTATCACCAAAATATTCAGTTTTTGC
GTGGAAATGCTTTAAAACCGCTAATTGAAAATGATATAAAATTGGATGGGCTGATATCTAATCCACCATA
CATAGGCCATAGTGAAATAATAGATATGGAGTCAACAGTACTAAATTATGAGCCACATCATGCTCTATTT
GCTGAGAAAAACGGATTTGCTATTTATGAGTCAATATTAGAAGATTTACCATTTGTAATGAAACAAGGTG
GACATGTTGTTTTTGAAATAGGTTATAGTCAAGGAGATATCTTAAAAAGAATGATTCAAGATTTATATCC
TGAAAAAGAAGTAGAGATTTTCAAAGATATCAATGGAAATCAGCGTATTATATCTATTATTTGGTAG

Taxonomy assignment of predicted proteins

Hi,
I´m running an illumina metagenomics WGS experiment (2x150bp).
I have assembled my metagenomes with Megahit and predicted genes using prodigal and converted those to proteins.
Now I would like to assign the taxonomy of the proteins and was wondering if kaiju is suitable for that ? (using refSeq as a database).

( A second option would be to run hmm on a marker genes databases but i found this too restrictive at the moment since you may miss some species)., althought there are at least 4 to 5 databases of marker genes out there that could be used.

Thanks for your advice.
david

KaijuReport not assigning viruses to any taxonomy rank

Hi,

I'm having an issue when I run this that it isn't provide any classification rank in the KaijuReport function. The output will just say the following regardless of which taxonomic rank I use

46.769950 216000 Viruses
2.405404 11109 cannot be assigned to a order

Any idea why this is and how to fix it?

Thanks

Kaiju and Conda

Hi,

I've been trying to get Kaiju to work as a package in bioconda, but I've been having some problems with the makeDB script. The issue is explained here: https://bitbucket.org/snakemake/snakemake/issues/465/conda-and-symbolic-links. Could you maybe consider other solutions than using e.g. $SCRIPTDIR/convertNR, such as first checking if it's on PATH? Or maybe there can be a solution with readlink? Thanks!

Cheers,
Rasmus

Doc on best match

Hi,
In the kaiju help explaining kaiju output it says:

4- the length or score of the best match used for classification.

Does this correspond to the total length of the best target of only the length of the aligned part of the target ? It´s not clear for me if you are representing the length. If it is score which score ?

Thanks for clarification,

error in kaiju.report

./kaijuReport -t kaiju_db/kaiju_index/nodes.dmp -n kaiju_db/kaiju_index/names.dmp -i RefSeq_13_S28.kaiju.out.csv -r species -o RefSeq_13_S28.kaiju.out.summary

I'm getting there lines for viruses:

`
% reads species

3.420722 1661 Alistipes finegoldii
0.988529 480 Bacteroides cellulosilyticus
0.683733 332 Bacteroides fragilis
0.595177 289 Bacteroides thetaiotaomicron
0.570464 277 Bacteroides ovatus
0.313034 152 Bacteroides caecimuris

66.775131 32424 Viruses
3.562823 1730 cannot be assigned to a species

21.152461 10271 unclassified
`
It seems, that kaiju can't split viruses to speies in report file, but able to do in in krona file.

makedb.sh -e failing

I have attempted to build and use the makedb.sh -e kaiju database. It has failed with the following output:

$ ls -l ~/kaijudb_e/kaijudb_e/
total 167947636
-rw-rw-r-- 1 ubuntu ubuntu 33420490846 Mar 17 02:25 kaiju_db_nr_euk.bwt
-rw-rw-r-- 1 ubuntu ubuntu 35484801934 Mar 17 01:00 kaiju_db_nr_euk.faa
-rw-rw-r-- 1 ubuntu ubuntu 48369697476 Mar 17 02:37 kaiju_db_nr_euk.fmi
-rw-rw-r-- 1 ubuntu ubuntu 9380484230 Mar 17 02:25 kaiju_db_nr_euk.sa
-rw-r--r-- 1 ubuntu ubuntu 837101 Mar 16 23:20 merged.dmp
-rw-r--r-- 1 ubuntu ubuntu 138034822 Mar 16 23:20 names.dmp
-rw-r--r-- 1 ubuntu ubuntu 107384758 Mar 16 23:20 nodes.dmp
-rw-rw-r-- 1 ubuntu ubuntu 27940725654 Mar 16 07:45 nr.gz
-rw-rw-r-- 1 ubuntu ubuntu 14285367259 Mar 16 23:57 prot.accession2taxid
-rw-rw-r-- 1 ubuntu ubuntu 2811685571 Mar 12 08:38 prot.accession2taxid.gz
-rw-rw-r-- 1 ubuntu ubuntu 38816191 Mar 16 23:20 taxdump.tar.gz
$ ~/kaiju/bin/kaiju -t ~/kaijudb_e/kaijudb_e/nodes.dmp -f ~/kaijudb_e/kaijudb_e/kaiju_db.fmi -i /mnt/work/hisat/unaligned/unaligned_SRR926282qc.fq -v -o kaiju_e_test
03:20:10 Reading database
Reading taxonomic tree from file /home/ubuntu/kaijudb_e/kaijudb_e/nodes.dmp
Reading index from file /home/ubuntu/kaijudb_e/kaijudb_e/kaiju_db.fmi
Could not open file /home/ubuntu/kaijudb_e/kaijudb_e/kaiju_db.fmi
Kaiju 1.5.0
Copyright 2015,2016 Peter Menzel, Anders Krogh
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html

Usage:
/home/ubuntu/kaiju/bin/kaiju -t nodes.dmp -f kaiju_db.fmi -i reads.fastq [-j reads2.fastq]

Mandatory arguments:
-t FILENAME Name of nodes.dmp file
-f FILENAME Name of database (.fmi) file
-i FILENAME Name of input file containing reads in FASTA or FASTQ format

Optional arguments:
-j FILENAME Name of second input file for paired-end reads
-o FILENAME Name of output file. If not specified, output will be printed to STDOUT
-z INT Number of parallel threads (default: 1)
-a STRING Run mode, either "mem" or "greedy" (default: mem)
-e INT Number of mismatches allowed in Greedy mode (default: 0)
-m INT Minimum match length (default: 11)
-s INT Minimum match score in Greedy mode (default: 65)
-x Enable SEG low complexity filter
-p Input sequences are protein sequences
-v Enable verbose output

I had ample ram to build the database, and the hard drive has an extra ~43 GB of space after the makedb.sh -e command finishes. Additionally, the output of makedb.sh -e informs me that the building process has finished.

I ran makedb.sh -p and it worked fine.

Any guidance on this issue would be greatly appreciated.

verbose (-v) output "accession number" not shown

I've set to verbose output mode because I want to know the accession number of the sequence with best match, however the 6th column (the accession numbers of all database sequences with the best match) is empty in my output, but all other fields match to your description of the verbose output. See example of the output I get:

U readname1 0
U readname2 0
C readname3 5125 18 5113, PEYNILKNAYSLLGFKHS,
C readname4 930090 33 930090, ENPLKSRHDICFNYKNFRGCYLWTSKTTGKQYI,
C readname5 930090 33 930090, ENPLKSRHDICFNYKNFRGCYLWTSKTTGKQYI,

Your description:
1 either C or U, indicating whether the read is classified or unclassified.
2 name of the read
3 NCBI taxon identifier of the assigned taxon
4 the length or score of the best match used for classification
5 the taxon identifiers of all database sequences with the best match
6 the accession numbers of all database sequences with the best match
7 matching fragment sequence(s)

As you can see there is no column matching description number 6

Thank you

Compilation issues

Hi, I'm trying to compile the program, but I'm facing this error because the header file which is included in file hspfilter_collector.c is not available:
error: No rule to make target '/algo/blast/core/blast_hits_priv.h', needed by 'obj/hspfilter_collector.o'.

-Gazal

Wrong unclassified pct. value in report file

The percentage of the unclassified sequences seems to be wrong when using kaijuReport. E.g. the script reports 18.13% though there are 3888 of 17555 unclassified sequences, so 22.15% would be correct. Other percentage values seem to be calculated correctly.

Edit
Used version: release 1.4

Version 1.4-6 reports correct percentage values.

bioinformatics-centre / kaiju Goto Github PK

kaiju's People

Contributors

Stargazers

Watchers

Forkers

kaiju's Issues

infilename= kaiju_db_nr_euk.faa

outfilename= kaiju_db_nr_euk

Alphabet= ACDEFGHIKLMNPQRSTVWY

nThreads= 12

length= 0.000000

checkpoint= 5

caseSens=OFF

revComp=OFF

term= *

revsort=OFF

help=OFF

infilename= kaiju_db.faa

outfilename= kaiju_db

Alphabet= ACDEFGHIKLMNPQRSTVWY

nThreads= 5

length= 0.000000

checkpoint= 3

caseSens=OFF

revComp=OFF

term= *

revsort=OFF

help=OFF

46.769950 216000 Viruses 2.405404 11109 cannot be assigned to a order

` % reads species

3.420722 1661 Alistipes finegoldii 0.988529 480 Bacteroides cellulosilyticus 0.683733 332 Bacteroides fragilis 0.595177 289 Bacteroides thetaiotaomicron 0.570464 277 Bacteroides ovatus 0.313034 152 Bacteroides caecimuris

66.775131 32424 Viruses 3.562823 1730 cannot be assigned to a species

Recommend Projects

Recommend Topics

Recommend Org

46.769950 216000 Viruses
2.405404 11109 cannot be assigned to a order

`
% reads species

3.420722 1661 Alistipes finegoldii
0.988529 480 Bacteroides cellulosilyticus
0.683733 332 Bacteroides fragilis
0.595177 289 Bacteroides thetaiotaomicron
0.570464 277 Bacteroides ovatus
0.313034 152 Bacteroides caecimuris

66.775131 32424 Viruses
3.562823 1730 cannot be assigned to a species