timkahlke / basta Goto Github PK

View Code? Open in Web Editor NEW

38.0 38.0 13.0 32.78 MB

Basic Sequence Taxonomy Annotator

License: GNU General Public License v3.0

Python 100.00%

basta's People

Contributors

Stargazers

Watchers

Forkers

maxibor dnieuw sdwfrost bioinfo-dirty-jobs1 vsnishtala liupfskygre genostack kebarr lwwal78 tkahlke davised

basta's Issues

question:is mapping database necessary?

Hi,timkahlke

According to the doc,mapping database may be optional?
# download and set up genbank and uniprot mappings
# NOTE: this might not be needed for you. See Wiki for details
basta download gb
basta download prot

But when running BASTA mapping database must be set:

# Infer one LCA for each query sequence of blast against uniprot
basta sequence BLAST_OUTPUT_FILE BASTA_OUTPUT_FILE prot
# Infer one LCA for the complete blast output file
basta single BLAST_OUTPUT_FILE prot
# Infer one LCA for each blast output file in a given directory
basta multiple BLAST_OUTPUT_DIRECTORY BASTA_OUTPUT_FILE prot

I just want to use BASTA to do LCA from DIAMOND, must the mapping parameters needed? tks

multiple doesn't work

Line 101 in AssignTaxonomy.py hast to be changed to

lca = self._assign_single(os.path.join(blast_dir,bf),db_file,best)

Why BASTA don't assigne with 1 hit

Hi !
I' m trying to assign my data with a custom database from GenBank and local sequences with high percentage of similarity (99%).
I ran blastn to obtain all the hit to my sequences with >99% similarity and then put the result in BASTA to obtain the LCA taxonomy.
I was very confuse of the few resulting matches, so I performed the analyse with the verbose option to see if the taxonomy of my hits were very divergente and I noticed that all my sequences with only one blastn hit were not assigned to the taxonomy of this hit. Is there any way to change this?
Best regards,
Marion

Multi thread usage

Hello:
My diamond output an file, which about 10 g, and I next to use the BASTA to estimate spices in it.
However, BASTA has been worked 10 days.
Could BASTA use more threads to make it faster.
What can I do to faster?
Thanks

Conda installation error

Conda installation on MacOS returns this error:

CondaVerificationError: The package for krona located at /Users/tomasz/miniconda3/pkgs/krona-2.7.1-pl526_1
appears to be corrupted. The path 'opt/krona/lib/._KronaTools.pm'
specified in the package manifest cannot be found.

filter' object is not subscriptable

I installed BASTA from the Conda package (python 3) but i am not able to setup the taxonomy.

taxdump.tar.gz.md5 100%[===============================================================>] 49 --.-KB/s in 0s

2022-02-15 15:38:52 (5.04 MB/s) - ‘/root/.basta/taxonomy/taxdump.tar.gz.md5’ saved [49]

[BASTA STATUS] Checking MD5 sum of file

Traceback (most recent call last):
File "/opt/miniconda/envs/basta_py3/bin/basta", line 4, in
import('pkg_resources').run_script('BASTA==1.4', 'basta')
File "/opt/miniconda/envs/basta_py3/lib/python3.10/site-packages/pkg_resources/init.py", line 662, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/opt/miniconda/envs/basta_py3/lib/python3.10/site-packages/pkg_resources/init.py", line 1459, in run_script
exec(code, namespace, namespace)
File "/opt/miniconda/envs/basta_py3/lib/python3.10/site-packages/BASTA-1.4-py3.10.egg/EGG-INFO/scripts/basta", line 118, in
main.run_basta(args)
File "/opt/miniconda/envs/basta_py3/lib/python3.10/site-packages/BASTA-1.4-py3.10.egg/basta/BastaMain.py", line 89, in run_basta
self._basta_taxonomy(args)
File "/opt/miniconda/envs/basta_py3/lib/python3.10/site-packages/BASTA-1.4-py3.10.egg/basta/BastaMain.py", line 186, in _basta_taxonomy
dutils.down_and_check("ftp://ftp.ncbi.nih.gov/pub/taxonomy/","taxdump.tar.gz",args.directory)
File "/opt/miniconda/envs/basta_py3/lib/python3.10/site-packages/BASTA-1.4-py3.10.egg/basta/DownloadUtils.py", line 60, in down_and_check
while(check_md5(md5,out_dir)):
File "/opt/miniconda/envs/basta_py3/lib/python3.10/site-packages/BASTA-1.4-py3.10.egg/basta/DownloadUtils.py", line 46, in check_md5
filehash.update(open(os.path.join(path,l[1])).read())
TypeError: 'filter' object is not subscriptable

Can you help tu understand the problem ?

No taxon found ...

"No taxon found" printed multiple times for the same non-found taxon.

Installation in conda

Hi,

upon an attempt to install basta in conda I go the following error. The yaml file seems to be badly interpreted.
$ wget https://github.com/timkahlke/BASTA/blob/master/environment_linux.yml
$ conda env create -f environment_linux.yml

...

$ /data/anaconda3/bin/conda-env create -f ./environment_linux.yml

Traceback (most recent call last):
  File "/data/anaconda3/lib/python3.6/site-packages/conda/exceptions.py", line 640, in conda_exception_handler
    return_value = func(*args, **kwargs)
  File "/data/anaconda3/lib/python3.6/site-packages/conda_env/cli/main_create.py", line 78, in execute
    directory=os.getcwd())
  File "/data/anaconda3/lib/python3.6/site-packages/conda_env/specs/__init__.py", line 20, in detect
    if spec.can_handle():
  File "/data/anaconda3/lib/python3.6/site-packages/conda_env/specs/yaml_file.py", line 14, in can_handle
    self._environment = env.from_file(self.filename)
  File "/data/anaconda3/lib/python3.6/site-packages/conda_env/env.py", line 80, in from_file
    return from_yaml(yamlstr, filename=filename)
  File "/data/anaconda3/lib/python3.6/site-packages/conda_env/env.py", line 68, in from_yaml
    data = yaml.load(yamlstr)
  File "/data/anaconda3/lib/python3.6/site-packages/ruamel_yaml/main.py", line 75, in load
    return loader.get_single_data()
  File "/data/anaconda3/lib/python3.6/site-packages/ruamel_yaml/constructor.py", line 60, in get_single_data
    node = self.get_single_node()
  File "/data/anaconda3/lib/python3.6/site-packages/ruamel_yaml/composer.py", line 53, in get_single_node
    document = self.compose_document()
  File "/data/anaconda3/lib/python3.6/site-packages/ruamel_yaml/composer.py", line 76, in compose_document
    self.get_event()
  File "/data/anaconda3/lib/python3.6/site-packages/ruamel_yaml/parser.py", line 136, in get_event
    self.current_event = self.state()
  File "/data/anaconda3/lib/python3.6/site-packages/ruamel_yaml/parser.py", line 215, in parse_document_end
    token = self.peek_token()
  File "/data/anaconda3/lib/python3.6/site-packages/ruamel_yaml/scanner.py", line 144, in peek_token
    self.fetch_more_tokens()
  File "/data/anaconda3/lib/python3.6/site-packages/ruamel_yaml/scanner.py", line 239, in fetch_more_tokens
    return self.fetch_value()
  File "/data/anaconda3/lib/python3.6/site-packages/ruamel_yaml/scanner.py", line 598, in fetch_value
    self.get_mark())
ruamel_yaml.scanner.ScannerError: mapping values are not allowed here
  in "<unicode string>", line 323, column 24:
      <!-- blob contrib key: blob_contributors:v21:0b8a2db9 ...

Regards,

Thierry

error when using GTDB mapping file in create_db

Hey @timkahlke

I am using a custom mapping database from GTDB in the format specified in the wiki. But when I use it for create_db command, it gives the following error-

mapping file-
accession accession.version taxid gi
GCA007129655 GCA007129655.1 2022
GCF000979455 GCF000979455.1 669
GCA007280465 GCA007280465.1 3546
GCF000025865 GCF000025865.1 4262
GCA004525545 GCA004525545.1 3017
GCA007118145 GCA007118145.1 2546

code used-
/home/j/jigyasa-arora/local/BASTA/bin/basta create_db accession2taxid.tsv prot_mapping.db 0 2

error-
Creating database

[BASTA STATUS] Reading mapping file
This might take a while, please be patient ...

Traceback (most recent call last):
File "/home/j/jigyasa-arora/local/BASTA/bin/basta", line 115, in
main.run_basta(args)
File "/home/j/jigyasa-arora/.local/lib/python3.7/site-packages/BASTA-1.3.2.3-py3.7.egg/basta/BastaMain.py", line 86, in run_basta
self._basta_create_db(args)
File "/home/j/jigyasa-arora/.local/lib/python3.7/site-packages/BASTA-1.3.2.3-py3.7.egg/basta/BastaMain.py", line 171, in _basta_create_db
dbutils.create_db(args.directory,args.input,args.output,args.key,args.value)
File "/home/j/jigyasa-arora/.local/lib/python3.7/site-packages/BASTA-1.3.2.3-py3.7.egg/basta/DBUtils.py", line 67, in create_db
lookup.put(ls[i1],ls[i2])
TypeError: Argument 'key' has incorrect type (expected bytes, got str)

create_db

Custom db doesn't have ending on database directory

INDEX ERROR

Hi,
I ran blastn online, download the hits table as a csv and converted it to a tab delimited file. This is supposed to be required input for basta. However, I still get the following error:

#
# INDEX ERROR WHILE CHECKING e-value, alingment length OR percent  identity!!!.
# Are you sure that your input file has the correct format?
# (For details check https://github.com/timkahlke/BASTA/wiki/3.-BASTA-Usage#input-file-format)
#
#####

Please advise on how to fix this.
Thanks,
Ilya.

Any perspective for BASTA python 3?

Updates on Python 3 version?

Hi @timkahlke
Any updates on when a Python 3 version of BASTA will be available? I am hoping to use BASTA within a computational pipeline and would like to avoid having to replace BASTA when Python 2 becomes unsupported by my university's HPC.
Thanks!

Specie level not reached

Dear @timkahlke ,
I've been trying out BASTA on simulated data, however, I can never get down to the specie level:
Here is an example of my blast output:

tmp19	NC_029448.1	91.67	48	4	0	53	100	9950	9997	2e-10	67.6
tmp19	NC_029330.1	91.30	46	4	0	54	99	10854	10899	3e-09	63.9
tmp19	NC_023799.1	91.30	46	4	0	54	99	9948	9993	3e-09	63.9
tmp19	NC_022507.1	90.00	50	4	1	51	100	9961	10009	3e-09	63.9
tmp20	NC_035317.1	100.00	100	0	0	1	100	60015	60114	5e-46	185
tmp21	NC_035995.1	100.00	100	0	0	1	100	24700	24799	5e-46	185
tmp21	NC_029485.1	100.00	100	0	0	1	100	23785	23884	5e-46	185
tmp21	NC_028523.1	100.00	100	0	0	1	100	24181	24280	5e-46	185

For the sequence tmp20, there is only one hit, so I should be able to go down the specie level, since the full taxonomic lineage is known for NC_035317.1
However, BASTA only goes to the genus level:

tmp20	Eukaryota;Streptophyta;Liliopsida;Alismatales;Hydrocharitaceae;Stratiotes;

Here is the basta command line I used:

basta sequence blast_results_100.out basta_results_100.out gb -m 1 -n 10 -i 99

strip() error

Strip on empty taxa throws error

create_db

When adding a new mapping db files are

have to be in /taxonomy (only name needed, path ignored)
only gzipped files are accepted

basta taxonomy problem: a loop of re-downloading - md5 sum mismatching

Hi,
could you please help me with basta taxonomy problem?
The taxdump.tar.gz gets downloaded but the md5 sum does not match so the file is re-downloaded... resulting in a never-ending loop of re-downloading and md5 sum mismatching...
I dont know if the problem is with me or with NCBI.
I would appreciate any advice on that.
With best regards,
Dasa

No mappings found

Hi,

I've been trying to use basta on output from diamond. I believe my diamond results are in the correct format that is default for basta (-outfmt 6) and the accession I'm finding (via grep) are in the prot.accession2taxid.FULL that was used to generate my database, however I am not getting any taxa names whjen I run basta.

Here is an example line for my basta input:
M01019:41:000000000-A5RV8:1:1114:11342:10949 MBS1567671.1 92.2 51 4 0 2 154 139 189 1.43e-26 107

When I use basta, I then get "No mapping found for MBS1567671" and the resulting output file has everything as "unknown".
It looks like for some reason basta is ignoring the ".1", so although it should search for MBS1567671.1 it's searching for MBS1567671. Both my diamond and basta databases were generated using the same version of the prot.accession2taxid.FULL.

It looks like this is a similar issue to one previously posted: #11. I attempted re-ran with -v and the file just has all my query sequence names in this format:

###M01019:41:000000000-A5RV8:1:1101:14530:2789


###M01019:41:000000000-A5RV8:1:1101:12152:2947

Any idea what I might need to do to fix this?

How do I specify custom directory of database?

I downloaded and created a NCBI taxonomy database with "-d option". When I run "basta sequence $INPUT_FILE $OUTPUT_FILE gb", warn "# [BASTA ERROR] No database gb_mapping.db found in /home/XX/.basta/taxonomy. Did you forget to create the specified database or was it a typo?". How do I specify custom directory of database?

BASTA sequence output

Hello @timkahlke, I have a question related to the basta sequence output.

If none of the hits from a query meet the criterias defined by the basta sequence arguments, will this query ID be present in the output as "Unknown"?

I have a DIAMOND output containing hits from 274,861,379 queries. I expected an output containing a line for each query, but my output has 168,316,701 lines. Thus, my Krona chart displays an wrong percentage of Unknown sequences.

running Basta on already downloaded mapping file

Hey @timkahlke

I followed the tutorial on how to create a database on already downloaded NCBI mapping file https://github.com/timkahlke/BASTA/wiki/2.-Initial-Setup#1-download-ncbi-sequence-databases.

The steps I am running-
$wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz #custom for nr database
$gunzip prot.accession2taxid.gz
$/work/BASTA-1.3.2.3/bin/basta create_db prot.accession2taxid prot_mapping.db 1 2
#create_db, using second column (1)->to get new accession ids (eg-WP_090162531.1) , and third column (2)-to get taxids

$/work/BASTA-1.3.2.3/bin/basta sequence basta-COG0552.txt basta-result-COG0552.txt prot
#running the basta on blast output

When I try the new database on my blast output, I get an empty result. Whereas just "grepping" the accession number to the prot.accession2taxid gives an output.

Where am I going wrong? (Using basta on uniref90 database works though)

Comments regarding arg parsing

Hi Tim,

I got everything up and running, now I'm working on figuring out the best settings for my particular dataset. I didn't run into any additional issues after the db setup steps.

Here are some unsolicited tips that I found while working with python that might help you out, as I've written a few tools and have had to go through a similar learning process as you have here. Not sure if you're still working with python regularly but I digress.

At any rate, I noticed that you take some flags and convert them to bool using argparse. This is generally not recommended:

https://docs.python.org/3/library/argparse.html#type and search for bool, you'll see the relevant section.

If you do want True/False values, you'd generally use the action='store_true', or if you want --foo and --no-foo as options, you can use action=argparse.BooleanOptionalAction since python 3.8.

I'm sure the vast majority of the time this is a non-issue, but people might be confused if they have to run basta --quiet True to turn on quiet, and then try basta --quiet False and see that it's still quiet. You can confirm this yourself easily (and I'm sure logically you know this already):

$ python -c 'print(bool("False"))'
True

Lastly, for future projects where you have multiple subparsers that share options, you can set them up as I have here:

https://github.com/davised/get_assemblies/blob/main/get_assemblies/__main__.py#L234 and scroll to the for p in all_p: line.

and iterate over them to add the options to each command in a loop. That way you reduce the redundant code and copy/paste errors when you want to change something (like the bool thing above, only having to change it one place). There may be even other, better ways to handle the subparsers, but this method has worked for me for several different projects.

Cheers,

Single has no output file

Multi error

When trying multi get an error

Modify basta2krona.py function (suggestion)

basta2krona py fails to parse basta output as there are single column rows with ### and empty lines.

Existing code:

def _parseBASTA(bf):

    counts = {}
    with open(bf,"r") as f:
        for line in f:
            ls = line.split("\t")
            try:
                counts[ls[1]] += 1 
            except KeyError:
                counts[ls[1]] = 1
    return counts

Proposed code:

def _parseBASTA(bf):

    counts = {}
    with open(bf, "r") as f:
        for line in f:
            if  len(line.strip())!=0 and not line.startswith("###"):
                ls = line.split("\t")
                try:
                    counts[ls[1]] += 1 
                except KeyError:
                    counts[ls[1]] = 1
        return counts

Test data:

###contig_764080
23	Eukaryota;
23	Eukaryota;Arthropoda;
23	Eukaryota;Arthropoda;Insecta;
23	Eukaryota;Arthropoda;Insecta;Lepidoptera;
23	Eukaryota;Arthropoda;Insecta;Lepidoptera;Bombycidae;
23	Eukaryota;Arthropoda;Insecta;Lepidoptera;Bombycidae;Bombyx;
23	Eukaryota;Arthropoda;Insecta;Lepidoptera;Bombycidae;Bombyx;Bombyx_mori;


###contig_765902
1	Bacteria;
1	Bacteria;Proteobacteria;
1	Bacteria;Proteobacteria;Betaproteobacteria;
1	Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;
1	Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Oxalobacteraceae;
1	Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Oxalobacteraceae;Candidatus_Zinderia;
1	Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Oxalobacteraceae;Candidatus_Zinderia;Candidatus_Zinderia_insecticola;

Index will be out of bounds with existing code.

how can I creat complete_taxa.db file use basta create_db?

Drear man,
I have download taxdump files already, when I run basta sequence, it shows I should creat complete_taxa.db, what should I need to do to creat complete_taxa.db ? I found a input file is need, but I don't kown to ues which file as input file .

taxonomy classification of assembled genome bins

Hi, developer,

Thanks for developing this amazing software. May I please know whether it can be used for taxonomy classification of a de novo assembled genome bin? Thank you!

UniProt database doesn't match Uniref90.fasta headers

Hey @timkahlke

I am running an LCA search for diamond blast output against the Uniref90 database using BASTA.
There are two things that I want to ask-

I first downloaded the NCBI protein database, and then the Uniprot database. But I got confused and deleted the prot.accession2taxid.gz file. The other mapping file and folder it generated were still intact before I downloaded the Uniprot database.
Would the absence of the prot.accession2taxid.gz file affect the BASTA search against the Uniprot database?
After I got a no-match against the Uniprot database, I grepped some Uniref90 IDs from my blast output to idmapping_selected.tab.gz, and I couldn't find common IDs.
Could you suggest a reason for that?
I downloaded the latest version from https://www.uniprot.org/downloads

Basta2Krona output html file is empty. (help)

Hi ! Thank you for the reply. I tried generating Krona chart with the script and basta output. Krona (output) html is empty.

 $ head Basta_output.tsv 

contig_48	Unknown	Eukaryota;Arthropoda;Insecta;Coleoptera;Carabidae;Amara;Amara_sp._KAO-2002;
contig_65	Unknown	Eukaryota;Arthropoda;Insecta;Coleoptera;Carabidae;Amara;Amara_alpina;
contig_117	Unknown	Eukaryota;Arthropoda;Insecta;Hymenoptera;Vespidae;Vespula;Vespula_pensylvanica;
contig_130	Unknown	Unknown
contig_214	Unknown	Viruses;unknown;unknown;unknown;Polydnaviridae;Bracovirus;Cotesia_sesamiae_bracovirus;
contig_375	Unknown	Eukaryota;Arthropoda;Insecta;Coleoptera;Carabidae;Zabrus;Zabrus_ignavus;
contig_408	Viruses;Phixviricota;Malgrandaviricetes;Petitvirales;Microviridae;Sinsheimervirus;Escherichia_virus_phiX174;	Viruses;Phixviricota;Malgrandaviricetes;Petitvirales;Microviridae;Sinsheimervirus;Escherichia_virus_phiX174;
contig_565	Unknown	Eukaryota;Arthropoda;Insecta;Coleoptera;Carabidae;Amara;Amara_alpina;
contig_597	Unknown	Eukaryota;Arthropoda;Insecta;Coleoptera;Zopheridae;Verodes;Verodes_sp._nov._C_ER-2011;
contig_619	Unknown	Eukaryota;Arthropoda;Insecta;Lepidoptera;Bombycidae;Bombyx;Bombyx_mori;

Command:

$ python3 basta2krona.py Basta_output.tsv Krona.html

I would appreciate if you could share an example working output so that I can troubleshoot output I have. Link to download the output file I have : https://docs.google.com/spreadsheets/d/1gIrihuvNo2mV3X0JgQGsKaCuiPd4Xp2IyAKZfLYpY6Q/edit?usp=sharing. Link contains a tsv file from personal gmail account.

Which mapping file for NR database

Hi,

If i use NR database, which mapping file should i download?

Thanks

Warning messages for mappings not found

Hi @timkahlke,

Is there some argument to hide the warning messages for mappings not found?

The printing of these warnings increases the time needed to finish the process (basta sequence).

Add LCA count

Hello @timkahlke ,
right now it is possible to get a count of hits per DB reference sequence (with the verbose flag), but it is not possible to get a count of hits per LCA, would it be doable to add it ?

Example:

###k79_72
11      Eukaryota;
11      Eukaryota;Streptophyta;
11      Eukaryota;Streptophyta;unknown;
11      Eukaryota;Streptophyta;unknown;Caryophyllales;
11      Eukaryota;Streptophyta;unknown;Caryophyllales;Chenopodiaceae;
9       Eukaryota;Streptophyta;unknown;Caryophyllales;Chenopodiaceae;Beta;
8       Eukaryota;Streptophyta;unknown;Caryophyllales;Chenopodiaceae;Beta;Beta_vulgaris;
8       Eukaryota;Streptophyta;unknown;Caryophyllales;Chenopodiaceae;Beta;Beta_vulgaris;
;
1       Eukaryota;Streptophyta;unknown;Caryophyllales;Chenopodiaceae;Beta;Beta_macrocarpa;
1       Eukaryota;Streptophyta;unknown;Caryophyllales;Chenopodiaceae;Beta;Beta_macrocarpa;
;
1       Eukaryota;Streptophyta;unknown;Caryophyllales;Chenopodiaceae;Chenopodium;
1       Eukaryota;Streptophyta;unknown;Caryophyllales;Chenopodiaceae;Chenopodium;Chenopodium_quinoa;
1       Eukaryota;Streptophyta;unknown;Caryophyllales;Chenopodiaceae;Chenopodium;Chenopodium_quinoa;
;
1       Eukaryota;Streptophyta;unknown;Caryophyllales;Chenopodiaceae;Spinacia;
1       Eukaryota;Streptophyta;unknown;Caryophyllales;Chenopodiaceae;Spinacia;Spinacia_oleracea;
1       Eukaryota;Streptophyta;unknown;Caryophyllales;Chenopodiaceae;Spinacia;Spinacia_oleracea;
;

Here the LCA would be Chenopodiaceae with 11 hits

What I'm doing on my side is combining the "regular" output of BASTA (that gives the LCA) and the "verbose" output of BASTA that gives count to get a count per LCA, but that's not really clean since I'm parsing both of these output files...

Receiving: BASTA WARNING No taxon found with custom mapping file

Hello,

I am trying to use BASTA on some DIAMOND tsv output. When I run basta sequence, however, I get a warning # [BASTA WARNING] No taxon found for 1346611 and the basta output file contains only "unknown" annotations.

My DIAMOND output looks as follows:
T070O:00666:05788\tWP_005933513.1_853\t78.3\t69\t15\t0\t208\t2\t1\t69\t1.7e-24\t119.0
I have created a custom config file to indicate that the first column is the read_id, the second is the annotation_id, etc.

WP_005933513.1_853 is basically the accession number and the taxon id concatenated by a "_", but in order to perform the mapping with BASTA I have created a mapping file, mapping each annotation_id to the corresponding taxon_id like this:
WP_005933513.1_853\t853
I have created a mapping db of that file using basta create_db.

I think that I have done everything I should correctly, but BASTA is unable to find the taxon eventhough I can find the taxon in the mapping file and the taxdump I downloaded using grep. I'm out of ideas...

database creation issue for uniprot e.g. type uni

Hi @timkahlke
I was running basta download -d /mnt/Indices/genomes/basta/taxonomy/ uni to download uniprot database. It is downloading the idmapping_selected.tab and creating the database with name "prot_mapping.db".

After this, I am running basta sequence dataset_19472.txt dataset_19493.txt uni --directory /mnt/Indices/genomes/basta/taxonomy/

[BASTA ERROR] No database uni_mapping.db found in /mnt/Indices/genomes/basta/taxonomy/. Did you forget to create the specified.

Please let me know the reason of failure.

buggy ouput

Dear Tim,

when running the basta commands (from a cloned git in the envs bin folder) I get the following buggy uotpu when downloading the taxdump and databse association files:
Traceback (most recent call last):

File "./basta", line 117, in
main.run_basta(args)
File "./../basta/BastaMain.py", line 71, in run_basta
self._basta_download(args)
File "./../basta/BastaMain.py", line 147, in _basta_download
dutils.down_and_check(args.ftp,map_file,args.directory)
File "./../basta/DownloadUtils.py", line 56, in down_and_check
self.down(ftp,md5,out_dir)
NameError: global name 'self' is not defined

The files are downloaded, however.

When running a primary trial analysis a full blown error appears:
$/data/anaconda3/envs/basta/bin/BASTA/bin/basta sequence ./data/assembly-nt.tab ./results/assembly-lca.test gb

Traceback (most recent call last):
File "/data/anaconda3/envs/basta/bin/BASTA/bin/basta", line 9, in
from basta import BastaMain as bm
File "/data/anaconda3/envs/basta/bin/BASTA/bin/../basta/BastaMain.py", line 6, in
import plyvel
ModuleNotFoundError: No module named 'plyvel'

The lacking module has been installed.
$ conda list

...
plyvel 0.8 py27_0 bnoon
...

I hope you can help.

Kind regards,

Thierry

[BASTA ERROR] Couldn't find complete_taxa.db in /work/uniprot Did you run initial 'basta download'?

Hi!
I am trying to assign taxonomy to a diamond blastp results using basta.
I downloaded the protein (prot) database using basta download prot.
In the database directory I have these two files: prot.accession2taxid.gz.md5, prot.accession2taxid.gz and the
prot_mapping.db folder.

I kept getting an error, looks like basta is looking for a db called complete_taxa.db.
Is there something that am missing in my command????

thank you

Cannot run parallel jobs

Hey @timkahlke

I am working with multiple metagenome taxonomy, and I was wondering if it's possible to parallelize the job?

When I try to do so, I get an error even though I am using different databases for two different runs-

Traceback (most recent call last):
File "/work/student/jigyasa-arora/BASTA-1.3.2.3/bin/basta", line 115, in
main.run_basta(args)
File "/home/j/jigyasa-arora/.local/lib/python2.7/site-packages/BASTA-1.3.2.2-py2.7.egg/basta/BastaMain.py", line 80, in run_basta
self._basta_multiple(args)
File "/home/j/jigyasa-arora/.local/lib/python2.7/site-packages/BASTA-1.3.2.2-py2.7.egg/basta/BastaMain.py", line 121, in _basta_multiple
assigner._assign_multiple(args.blast,db_file,args.best_hit)
File "/home/j/jigyasa-arora/.local/lib/python2.7/site-packages/BASTA-1.3.2.2-py2.7.egg/basta/AssignTaxonomy.py", line 101, in _assign_multiple
lca = self._assign_single(os.path.join(blast_dir,bf),db_file,best)
File "/home/j/jigyasa-arora/.local/lib/python2.7/site-packages/BASTA-1.3.2.2-py2.7.egg/basta/AssignTaxonomy.py", line 79, in _assign_single
(tax_lookup, map_lookup) = self._get_lookups(db_file)
File "/home/j/jigyasa-arora/.local/lib/python2.7/site-packages/BASTA-1.3.2.2-py2.7.egg/basta/AssignTaxonomy.py", line 108, in _get_lookups
tax_lookup = db._init_db(os.path.join(self.directory,"complete_taxa.db"))
File "/home/j/jigyasa-arora/.local/lib/python2.7/site-packages/BASTA-1.3.2.2-py2.7.egg/basta/DBUtils.py", line 75, in _init_db
lookup = plyvel.DB(os.path.abspath(db))
File "plyvel/_plyvel.pyx", line 247, in plyvel._plyvel.DB.init
File "plyvel/_plyvel.pyx", line 88, in plyvel._plyvel.raise_for_status
plyvel._plyvel.IOError: IO error: lock /home/j/jigyasa-arora/.basta/taxonomy/complete_taxa.db/LOCK: Resource temporarily unavailable

How to create custom DataBase of de novo assembly?

I de novo assembly a genome from fastq files and want to remove organelle genomes (mitochondria, chloroplasts, etc.) and plasmids genomes. How should I set up a custom database of organelles and plasmids genomes? The genomes of organelles and plasmids were also de novo assembly.

verbose file doesn't update

Check for existing file and, if already there, warn or remove

timkahlke / basta Goto Github PK

basta's People

Contributors

Stargazers

Watchers

Forkers

basta's Issues

[BASTA STATUS] Checking MD5 sum of file

Recommend Projects

Recommend Topics

Recommend Org