When using the addTaxonNames helper function to add the full taxon path to each sequence, the number of elements in the taxon path is inconsistent, making it very difficult to consistently identify a given taxon level when working with the data in other software (e.g. R). For example, to extract all sequences annotated as belonging to a certain family, I would parse the full taxon path into substrings based on the semicolon divider - but the position of 'family' in the full taxon path ranges from the 5th to 10th element depending on the sequence. It would be possible to extract family names by looking for the -aceae ending, but this does not work for genera or for taxonomic levels that do not have a consistent naming convention. Thus it is currently impossible to do this type of analysis where simultaneous information on multiple taxonomic levels is needed.
Similarly, the most precise taxonomic identification is always added as the final element of the full taxon path, so there is always a duplicate element added that makes it difficult to interpret whether a given taxon name is the name assigned at a given rank, or if that is simply the most precise rank available, or is in fact some other type of annotation (for example "environmental samples" is given as a taxonomic rank in some cases).
Would it be possible to add an option to force the full taxon path output from addTaxonNames to include only a given subset of taxonomic levels (e.g. phylum/class/order/family/genus/species), or to force consistency so that a given element within the taxon path always represents a given taxonomic level (e.g. if family is always the 10th element it can be extracted).
And would it be possible to add an option to suppress the duplicate reporting of the most precise taxonomic identification possible as part of the full taxon path? This could still be reported in a separate field but as it stands it is placed at the end of the full taxon path.
Here is an example of the output of addTaxonNames for a dataset analyzed using kaiju and the nr+euk database, the same issue is present when using the complete genomes and nr databases.
C M02360:6:000000000-AC61R:1:1101:22111:5800 1033 341 1033, SLQPNSGAQGEYAGLLAIRAYHRSRGEAHRKVCLIPSSAHGTNPASASMVGMDVVVVACDARGDVDVEDLR, cellular organisms; Bacteria; Proteobacteria; Alphaproteobacteria; Rhizobiales; Bradyrhizobiaceae; Afipia; Afipia; C M02360:6:000000000-AC61R:1:1101:23444:5841 418856 72 418856, PAVLPDDQTYTHR, cellular organisms; Bacteria; Proteobacteria; Gammaproteobacteria; Nevskiales; Sinobacteraceae; Nevskia; Nevskia soli; Nevskia soli; C M02360:6:000000000-AC61R:1:1101:14687:5896 1248916 258 1248916, IEKESSVNEAIDRMRHSATRALLERDDVIIVASVSCLYGIGSVETYSAMTFALK,IEKESSVNEAIDRMRHSATRALLERDDVIIVASVSCLYGIGSVETYSAMTFALK, cellular organisms; Bacteria; Proteobacteria; Alphaproteobacteria; Sphingomonadales; Sphingomonadaceae; Sandarakinorhabdus; Sandarakinorhabdus sp. AAP62; Sandarakinorhabdus sp. AAP62; C M02360:6:000000000-AC61R:1:1101:25907:5899 1262833 73 1262833, DILIIGGGITGLSSAYF,DILIIGGGITGLSSAYF, cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Clostridia; Clostridiales; Clostridiaceae; Clostridium; environmental samples; Clostridium sp. CAG:710; Clostridium sp. CAG:710; C M02360:6:000000000-AC61R:1:1101:17581:5925 1302620 246 1302620, DHAAETGAAIPTEPVVFMKDPSTVVGPFDEVLVPRGSTKTDWEVELGVVIG,DHAAETGAAIPTEPVVFMKDPSTVVGPFDEVLVPRGSTKTDWEVELGVVIG, cellular organisms; Bacteria; Terrabacteria group; Actinobacteria; Actinobacteria; Micrococcales; Cellulomonadaceae; Cellulomonas; Cellulomonas sp. URHD0024; Cellulomonas sp. URHD0024; C M02360:6:000000000-AC61R:1:1101:17429:5938 1016849 244 1016849, YNPETDGKELVRAFRNIPGVETSSVFALNLLQLAPGGHLGRFIIWTSSAFSAL,YNPETDGKELVRAFRNIPGVETSSVFALNLLQLAPGGHLGRFIIWTSSAFSAL, cellular organisms; Eukaryota; Opisthokonta; Fungi; Dikarya; Ascomycota; saccharomyceta; Pezizomycotina; leotiomyceta; Eurotiomycetes; Chaetothyriomycetidae; Chaetothyriales; Herpotrichiellaceae; Exophiala; Exophiala sideris; Exophiala sideris; C M02360:6:000000000-AC61R:1:1101:23427:5942 1760 122 1760,1883,1888,1892,1907,1930,1938,1963,29303,35619,39478,40318,44060,47758,49185,54571,55952,58344,66378,66874,66875,67257,67298,67373,68213,68231,68249,68268,73044,76728,78355,83656,89050,100226,105425,114687,131568,146537,159449,193462,227882,249567,285535,352211,455632,457427,498367,500153,645465,661399,1055352,1078086,1136432,1156844,1157634,1157635,1157637,1172179,1203592,1288080,1288083,1380346,1380770,1428628,1463841,1463850,1463853,1463856,1463857,1463858,1463877,1463888,1463901,1463917,1463920,1463926,1463932,1576605,1580539,1592326,1592327,1592330,1592727,1650571,1736503, GRWKAVVVGSYERGDRAVTVQRLAELADFY, cellular organisms; Bacteria; Terrabacteria group; Actinobacteria; Actinobacteria; Actinobacteria; C M02360:6:000000000-AC61R:1:1101:20712:8466 290425 165 302406,310575, IVVLTPGMYNSAYFEHTFLAQQMGVELVEGKDL,IVVLTPGMYNSAYFEHTFLAQQMGVELVEGKDL,TIVVLTPGMYNSAYFEHTFLAQQMGVELVEGKDL,TIVVLTPGMYNSAYFEHTFLAQQMGVELVEGKDL, cellular organisms; Bacteria; Proteobacteria; Betaproteobacteria; Burkholderiales; Alcaligenaceae; Advenella; Advenella; C M02360:6:000000000-AC61R:1:1101:13179:8502 1391654 91 1391654, RQMIQAHIAKKATATVAAIPVPR,RQMIQAHIAKKATATVAAIPVPR, cellular organisms; Bacteria; Proteobacteria; delta/epsilon subdivisions; Deltaproteobacteria; Myxococcales; Sorangiineae; Labilitrichaceae; Labilithrix; Labilithrix luteola; Labilithrix luteola; C M02360:6:000000000-AC61R:1:1101:16385:8539 1120523 173 1120523, LAPNTSVTCTANYTVTQADVDSGKVTNTATATGTPPTG, cellular organisms; Bacteria; Terrabacteria group; Actinobacteria; Actinobacteria; Streptomycetales; Streptomycetaceae; Streptomyces; Streptomyces sp. MJM8645; Streptomyces sp. MJM8645;