jolespin / veba Goto Github PK

A modular end-to-end suite for in silico recovery, clustering, and analysis of prokaryotic, microeukaryotic, and viral genomes from metagenomes

License: GNU Affero General Public License v3.0

Shell 2.62% Python 97.08% R 0.08% Dockerfile 0.09% Rich Text Format 0.13%

bioinformatics genomics metagenomics metatranscriptomics microbial microbiome

veba's People

Contributors

Stargazers

Watchers

Forkers

hyphaltip pythseq nick-youngblut jdilme new-atlantis-labs julianzaugg jcventerinstitute dengzq1234 wilsonyangliu

veba's Issues

[Bug] IndexError: Replacement index 1 out of range for positional args tuple when running `binning-eukaryotic.py`

Describe the bug:
MetaEuk runs to completion but the compile_metaeuk_identifiers.py script throws an error when there are duplicate gene identifiers.

Versions

v1.0.3

Command used to produce error:
Steps to reproduce the behavior

(VEBA-binning-eukaryotic_env) compile_metaeuk_identifiers.py --cds veba_output/binning/eukaryotic/G00009_30-PE-D703D505-1_S109/intermediate/2__metaeuk/metaeuk.codon.fas --protein veba_output/binning/eukaryotic/G00009_30-PE-D703D505-1_S109/intermediate/2__metaeuk/metaeuk.fas -o veba_output/binning/eukaryotic/G00009_30-PE-D703D505-1_S109/intermediate/2__metaeuk -b gene_models
Parsing MetaEuk headers: 100%|███████████████████████████████████████████████████████████████████████████████| 4124/4124 [00:00<00:00, 143228.78 gene/s]
Traceback (most recent call last):
  File "/expanse/projects/jcl110/anaconda3/envs/VEBA-binning-eukaryotic_env/bin/compile_metaeuk_identifiers.py", line 353, in <module>
    main()
  File "/expanse/projects/jcl110/anaconda3/envs/VEBA-binning-eukaryotic_env/bin/compile_metaeuk_identifiers.py", line 227, in main
    print("There are {} duplicate gene identifiers:{}".format((geneid_value_counts > 1).sum()), file=sys.stderr)
IndexError: Replacement index 1 out of range for positional args tuple

[Feature Request] Expanding Viral Identification & Analysis Toolset

Dear Josh,

I hope all is well. Erfan again!

I wanted to suggest expanding the VEBA's viral binning toolkit. I specifically request this since you mentioned VEBA is there to speed up the ad-hoc commands of meta-omics pipelines. Viral assembly using multiple software is definitely one of the most troublesome since multiple software is needed to classify contigs and

I've attached my personal pipeline for viral contig identification:

I now use Virsorter2
DeepVirFinder is just a deep network version of VirFinder so it might be redundant

The assembly phase usually relies on more than one tool if you'd like to make a convincing argument about viral diversity (e.g. PPR-Meta is the only one that identifies phages with high accuracy). The output then needs integration to remove redundant sequences, pool contigs, perform consensus classification, and ultimately rename them for the sake of consistency and clarity (like which software it came from). All of which I think VEBA is there to resolve.

I think this would give VEBA the edge in analyzing viral contigs in a well-structured manner, a field that I think is growing exponentially every day. I think even including just CheckV, VirSorter2, DeepVirFinder/VirFinder, and PPR-Meta should be more than enough.

Hope these help, and look forward to seeing VEBA grow.

Best,

Erfan

[Bug] different number of reads in paired input files; repair.sh does not seem to be working in preprocess.py

Describe the bug:
Preprocessing module is failing due to different number of reads in paired input files either at step 2 (Bowtie2) or step 3 (Bbduk). I believe this is meant to be fixed automatically as part of repair.sh in the preprocess module, but does not seem to be working. To get around the error, I ran repair.sh first and then VEBA’s preprocess.py module.

Versions

-bash-4.2$ source activate VEBA-preprocess_env 
(VEBA-preprocess_env) -bash-4.2$ preprocess.py -v
preprocess.py v2023.2.28

Command used to produce error:

This occurs when running the preprocessing module.

Please provide the following files:
Two types of errors are occuring:

Failing at step 2 (Bowtie2):

cat ../veba_output/preprocess/G_1_S97_L002/log/2__bowtie2.e
Error, fewer reads in file specified with -2 than in file specified with -1
terminate called after throwing an instance of 'int'
(ERR): bowtie2-align died with signal 6 (ABRT) (core dumped)

Failing at step 3 (Bbduk):

cat ../veba_output/preprocess/G_2_S98_L002/log/3__bbduk.e
java -ea -Xmx85810m -Xms85810m -cp /home/yac027/mambaforge3/envs/VEBA-preprocess_env/opt/bbmap-39.01-0/current/ jgi.BBDuk zl=1 overwrite=t threads=4 in1=/panfs/yac027/VEBA_analyses/Tatsuya_shotgun/veba_output/preprocess/G_2_S98_L002/intermediate/2__bowtie2/cleaned_1.fastq.gz in2=/panfs/yac027/VEBA_analyses/Tatsuya_shotgun/veba_output/preprocess/G_2_S98_L002/intermediate/2__bowtie2/cleaned_2.fastq.gz ref=/panfs/yac027/VEBA_analyses/veba_database/Contamination/kmers/ribokmers.fa.gz k=31 minlen=75 out1=/panfs/yac027/VEBA_analyses/Tatsuya_shotgun/veba_output/preprocess/G_2_S98_L002/intermediate/3__bbduk/non-kmer_hits_1.fastq.gz out2=/panfs/yac027/VEBA_analyses/Tatsuya_shotgun/veba_output/preprocess/G_2_S98_L002/intermediate/3__bbduk/non-kmer_hits_2.fastq.gz outm1=/panfs/yac027/VEBA_analyses/Tatsuya_shotgun/veba_output/preprocess/G_2_S98_L002/intermediate/3__bbduk/kmer_hits_1.fastq.gz outm2=/panfs/yac027/VEBA_analyses/Tatsuya_shotgun/veba_output/preprocess/G_2_S98_L002/intermediate/3__bbduk/kmer_hits_2.fastq.gz
Executing jgi.BBDuk [zl=1, overwrite=t, threads=4, in1=/panfs/yac027/VEBA_analyses/Tatsuya_shotgun/veba_output/preprocess/G_2_S98_L002/intermediate/2__bowtie2/cleaned_1.fastq.gz, in2=/panfs/yac027/VEBA_analyses/Tatsuya_shotgun/veba_output/preprocess/G_2_S98_L002/intermediate/2__bowtie2/cleaned_2.fastq.gz, ref=/panfs/yac027/VEBA_analyses/veba_database/Contamination/kmers/ribokmers.fa.gz, k=31, minlen=75, out1=/panfs/yac027/VEBA_analyses/Tatsuya_shotgun/veba_output/preprocess/G_2_S98_L002/intermediate/3__bbduk/non-kmer_hits_1.fastq.gz, out2=/panfs/yac027/VEBA_analyses/Tatsuya_shotgun/veba_output/preprocess/G_2_S98_L002/intermediate/3__bbduk/non-kmer_hits_2.fastq.gz, outm1=/panfs/yac027/VEBA_analyses/Tatsuya_shotgun/veba_output/preprocess/G_2_S98_L002/intermediate/3__bbduk/kmer_hits_1.fastq.gz, outm2=/panfs/yac027/VEBA_analyses/Tatsuya_shotgun/veba_output/preprocess/G_2_S98_L002/intermediate/3__bbduk/kmer_hits_2.fastq.gz]
Version 39.01

Set threads to 4
0.046 seconds.
Initial:
Memory: max=89992m, total=89992m, free=89925m, used=67m

Added 12672418 kmers; time: 	5.325 seconds.
Memory: max=89992m, total=89992m, free=88785m, used=1207m

Input is being processed as paired
Started output streams:	0.403 seconds.
[E::bgzf_read_block] [E::bgzf_read_block] Invalid BGZF header at offset 210118798Invalid BGZF header at offset 202178030

[E::bgzf_read] [E::bgzf_read] Read block operation failed with error 6 after 23296 of 65536 bytesRead block operation failed with error 6 after 38912 of 65536 bytes

Error 7 in block starting at offset 209941143(C837297)
Error 7 in block starting at offset 202013184(C0A7A00)
java.lang.AssertionError: 
There appear to be different numbers of reads in the paired input files.
The pairing may have been corrupted by an upstream process. It may be fixable by running repair.sh.
	at stream.ConcurrentGenericReadInputStream.pair(ConcurrentGenericReadInputStream.java:499)
	at stream.ConcurrentGenericReadInputStream.readLists(ConcurrentGenericReadInputStream.java:364)
	at stream.ConcurrentGenericReadInputStream.run0(ConcurrentGenericReadInputStream.java:208)
	at stream.ConcurrentGenericReadInputStream.run(ConcurrentGenericReadInputStream.java:184)
	at [java.base/java.lang.Thread.run(Thread.java:833)](url)

Also, please provide the input files used so I can reproduce the error and further diagnose (I'll be able to help you better). In the meantime, I am currently working on more useful error logs so patience is appreciated.

How do I start with pre-filtered data ready to assemble?

Hello,
I will start by saying that I do not have much experience with coding, and I apologize if my question is very basic.
I have data that has been filtered and is ready to assemble. I have only two files, 05_350.nohost.fq1.gz (forward) and 05_350.nohost.fq2.gz (reverse). I tried to run the following:

R1=/nohost/05_350.nohost.fq1.gz
R2=/nohost/05_350.nohost.fq2.gz
OUT_DIR= veba_output/assembly

CMD="source activate VEBA && veba --module assembly --params "-1 ${R1} -2 ${R2} -o ${OUT_DIR} -p ${N_JOBS} -P metaspades.py""

I did not get any errors, but simply nothing happened. I am not sure what I am doing wrong. Please help.

Thank you!

VEBA-binning-prokaryotic module - step 7_dastool error: /bin/sh: 1: Syntax error: redirection unexpected

Hi Josh,

Whilst running the VEBA-binning-prokaryotic module, I get this error at step 7_dastool - /bin/sh: 1: Syntax error: redirection unexpected.

Not sure if this is an issue with my set up or with the script running dastool?

Here's a link to the scaffolds.fasta and mapped.sorted.bam files from sample SRR17458603: https://www.dropbox.com/sh/fcjjkcojsvthh2a/AABHfFUupdUbdj7y7VSHeAuJa?dl=0

Cheers

[Bug] classify-prokaryotic.py excludes archaeal genomes (Downstream Issue in building custom HUMAnN db)

Describe the bug:
When trying to perform pathway profiling I can't complete it because the number of genomes in taxonomy does not match the number of genomes in "identifier_mapping". On further review I found that the missing genomes are binned on prokaryotes but don't get classified by GTDB-tk. How can I continue the pathway profiling?

Versions

(VEBA-profile_env) [ec2-user@ip-172-31-82-237 workdir]$ profile-pathway.py -v
profile-pathway.py v2023.10.16

Command used to produce error:

zcat veba_output/cluster/output/global/identifier_mapping.proteins.tsv.gz  | cut -f1,3 | tail -n +2 | compile_custom_humann_database_from_annotations.py -a veba_output/annotation/output/annotations.proteins.tsv.gz -s veba_output/misc/all_genomes.all_proteins.lt100k.faa -o veba_output/misc/humann_uniref_annotations.tsv -t veba_output/classify/taxonomy_classifications.tsv
--identifier_mapping <_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>
 * 2034593 proteins
 * 663 genomes
--annotations veba_output/annotation/output/annotations.proteins.tsv.gz
 * 1890856 proteins
 * 1007210 UniRef hits
--taxonomy veba_output/classify/taxonomy_classifications.tsv
 * 659 genomes
 * 411 taxonomic classifications
Calculating length of proteins: veba_output/misc/all_genomes.all_proteins.lt100k.faa: 2034593 Proteins [00:06, 323460.54 Proteins/s]
Traceback (most recent call last):
  File "/home/ec2-user/veba_efs/mambaforge/envs/VEBA-profile_env/bin/compile_custom_humann_database_from_annotations.py", line 119, in <module>
    main()
  File "/home/ec2-user/veba_efs/mambaforge/envs/VEBA-profile_env/bin/compile_custom_humann_database_from_annotations.py", line 95, in main
    assert A2 == C1, "Genomes in --identifier_mapping do not match genomes in --taxonomy.\n\nThe following genomes are specific to --identifier_mapping: {}\n\nThe following genomes are specific to --taxonomy".format("\n".join(A2 - C1), "\n".join(C1 - A2))
AssertionError: Genomes in --identifier_mapping do not match genomes in --taxonomy.


The following genomes are specific to --identifier_mapping: P5_S2_R__METABAT2__P.1__bin.20
M7_S10_R__CONCOCT__P.1__168
M10_1_S12_R__CONCOCT__P.1__46_sub
P10_S5_R__METABAT2__P.1__bin.28_sub

The following genomes are specific to --taxonomy

Please provide the following files:

1__gtdbtk.e.txt
1__gtdbtk.o.txt
1__gtdbtk.returncode.txt
P5_S2_R_bins.list.txt
taxonomy.tsv.txt

Thanks in advance

[Feature Request] bioconda recipe

Installing VEBA is quite comprehensive and not easy for non-bioinformaticians. Is there a bioconda recipe/package or plans hereof for Veba?

Issue: Broken pipe error in kofamscan step

Dear Josh,

I hope this message finds you well. I am currently using VEBA 1.3.0 on a server, and I am encountering an issue during the annotation step, specifically with the kofamscan tool.

> 
> =============
> . kofamscan .
> =============
> Input: ['veba_output/misc/all_genomes.all_proteins.lt100k.faa', '/home/ec2-user/veba_efs/database/Annotate/KOFAM/profiles', '/home/ec2-user/veba_efs/database/Annotate/KOFAM/ko_list']
> Output: ['veba_output/annotation/intermediate/8__kofamscan/output.tsv.gz']
> 
> Command:
> ( mkdir -p veba_output/annotation/tmp/kofamscan && /home/ec2-user/veba_efs/mambaforge/envs/VEBA-annotate_env/bin/exec_annotation --cpu 36 -f detail-tsv -p /home/ec2-user/veba_efs/database/Annotate/KOFAM/profiles -k /home/ec2-user/veba_efs/database/Annotate/KOFAM/ko_list --tmp-dir veba_output/annotation/tmp/kofamscan veba_output/misc/all_genomes.all_proteins.lt100k.faa | grep "*" > veba_output/annotation/intermediate/8__kofamscan/output.tsv ) && pigz -f -p 36 veba_output/annotation/intermediate/8__kofamscan/output.tsv 
> 
> Validating the following input files:
> [=] File exists (905 MB): veba_output/misc/all_genomes.all_proteins.lt100k.faa
> [=] Directory exists (6832 MB): /home/ec2-user/veba_efs/database/Annotate/KOFAM/profiles
> [=] File exists (2 MB): /home/ec2-user/veba_efs/database/Annotate/KOFAM/ko_list
> 
> Running. .. ... .....
> Check log files to diagnose error:
> cat veba_output/annotation/log/8__kofamscan.*
> 
> cat veba_output/annotation/log/8__kofamscan.*
> /home/ec2-user/veba_efs/mambaforge/envs/VEBA-annotate_env/bin/lib/kofam_scan/parallel.rb:28:in `write': Broken pipe (Errno::EPIPE)
> 	from /home/ec2-user/veba_efs/mambaforge/envs/VEBA-annotate_env/bin/lib/kofam_scan/parallel.rb:28:in `puts'
> 	from /home/ec2-user/veba_efs/mambaforge/envs/VEBA-annotate_env/bin/lib/kofam_scan/parallel.rb:28:in `block in exec'
> 	from /home/ec2-user/veba_efs/mambaforge/envs/VEBA-annotate_env/lib/ruby/2.7.0/open3.rb:219:in `popen_run'
> 	from /home/ec2-user/veba_efs/mambaforge/envs/VEBA-annotate_env/lib/ruby/2.7.0/open3.rb:101:in `popen3'
> 	from /home/ec2-user/veba_efs/mambaforge/envs/VEBA-annotate_env/bin/lib/kofam_scan/parallel.rb:27:in `exec'
> 	from /home/ec2-user/veba_efs/mambaforge/envs/VEBA-annotate_env/bin/lib/kofam_scan/executor.rb:107:in `run_hmmsearch'
> 	from /home/ec2-user/veba_efs/mambaforge/envs/VEBA-annotate_env/bin/lib/kofam_scan/executor.rb:35:in `execute'
> 	from /home/ec2-user/veba_efs/mambaforge/envs/VEBA-annotate_env/bin/lib/kofam_scan/executor.rb:8:in `execute'
> 	from /home/ec2-user/veba_efs/mambaforge/envs/VEBA-annotate_env/bin/lib/kofam_scan/cli.rb:21:in `run'
> 	from /home/ec2-user/veba_efs/mambaforge/envs/VEBA-annotate_env/bin/exec_annotation:7:in `<main>'
> 1
> 
>

I have attempted to resolve the issue by referring to the GitHub repository (takaram/kofam_scan#7), but I couldn't find a solution that works for my case.

I would appreciate any guidance or assistance you could provide to help resolve this issue.
Thank you for your time and support.

Duplicate scaffold/contig identifiers are not allowed

Hi Josh,

I am comparing the regular workflow to the co-assembly method on a small subset of my own samples. When running cluster.py of the co-assemblies, I am getting a duplicate scaffold/contig identifiers error, however it works perfectly fine with the regular workflow.

For instance

1__global_clustering.e

"Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/VEBA-cluster_env/bin/scripts/global_clustering.py", line 466, in
main(sys.argv[1:])
File "/home/ubuntu/miniconda3/envs/VEBA-cluster_env/bin/scripts/global_clustering.py", line 207, in main
assert id_scaffold not in scaffold_to_mag, "Duplicate scaffold/contig identifiers are not allowed: {} from {}".format(id_scaffold, row["genome"])
AssertionError: Duplicate scaffold/contig identifiers are not allowed: coassembly__k141_330954 from veba_output/binning/prokaryotic/S12/output/genomes/S12__METABAT2__P.2__bin.7.fa"

Is there a way around this?

Cheers,

Matt

EnvironmentLocationNotFound error for installation on HPC with sudo rights

Thanks for the tool. I am currently trying to test it. I am working on a HPC without sudo rights. Having said that, I managed to install all modules by setting my conda environments to a specific folder. However, when I try to run , e.g., annotate, it fails.

EnvironmentLocationNotFound: Not a conda environment: /xxx/global/software/skylake/software/Mamba/23.3.1-1/envs/VEBA-annotate_env

Traceback (most recent call last):
File "/home/xxx/.conda/envs/VEBA/bin/annotate.py", line 10, in
from genopype import *
ModuleNotFoundError: No module named 'genopype'

I'm guessing this will happen no matter the module. How can I fix this?
Thanks in advance.
Cheers,
Joao

[Feature Request] Adding MEGAHIT to assembly module

Hi and thanks for the great suite!

I'm a deep sea genomic data analyst and I've found your pseudo-co-assembly method and the eukaryotic database extremely helpful in identifying microeukaryotic metagenomes in complex and difficult-to-reach environments.

I wanted to kindly request if a rapid assembly software like MEGAHIT could be added to the assembly module. I do acknowledge the high-quality performance of metaSPADEs, but for large datasets, the computational requirement can be extremely intensive (being the only assembly software my labmates and I cannot run on our server). I personally use MEGAHIT and was wondering if it (or any alignment software of its kind) could be added to the suite.

Best,

Erfan Shekarriz

[Question] What does the homogenity value mean in the output taxonomy.tsv file of classify-eukaryotic

Hello,

I recently was trying to run VEBA 1.2.0 on some of my already-built eukaryotic MAGs to assess their taxonomy with the veba_classify module. While the program worked smoothly, I don't know if I understand the output of the taxonomy.csv file given by the classify-eukaryotic.py script. What does the homogeneity value represent? Some of the taxonomy given is very specific (species level), but how accurate is it?

Thank you!

PD. Fantastic work with VEBA! Looks like a very promising tool for microeukaryotic metagenomics.

[Feature Request] Assembly-level Eukaryotic Metatranscriptomic Functional Annotation & Read-count Mapping

Hi Josh,

I wanted to update you that the previous paper I was working on has been accepted and should be published in the following weeks. We made the decision to redact the metagenomics and eukaryotic binning analysis and focus on transcriptomic (for practical reasons, but we hope to do the eukaryotic binning soon for another study).

I still used VEBA for analyzing metatranscriptomic data. I know that for eukaryotic genomes VEBA uses annotations on the bin level, but realistically this would limit the usage for metatranscriptomic applications (You can find the poor performance of eukaryotic metatranscriptomic binning here https://taylorreiter.github.io/2017-10-02-Binning-metatranscriptomes/) .

Metatranscriptomics is still the best way to analyze eukaryotic genes in the environment, and I think since VEBA's selling point is eukaryotic (+viral) annotation, a pipeline for counting their genes would be really helpful. Personally, I don't have a pipeline for this, and instead, use some form of alignment (e.g. BLAST, HMMs) to extract the genes from the assembly first and then perform counting on it.

I think this would be a good feature to look into to really cater to those wanting to do eukaryotic omics analysis.

Best,

Erfan

[Question] CHECKVDB environment variable NOT set

Please confirm that you've checked the FAQ section:
https://github.com/jolespin/veba/blob/main/FAQ.md
Checked
If you still have a question, feel free to ask here.

Hello there, I have installed VEBA and run the check_installation.sh script.
The output shows that almost all passes, but the Checkvdb environmental variable
See this:

./check_installation.sh
[Pass] VEBA-annotate_env SUCCESSFULLY created.
[Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-annotate_env
[Pass] VEBA-assembly_env SUCCESSFULLY created.
[Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-assembly_env
[Pass] VEBA-binning-eukaryotic_env SUCCESSFULLY created.
[Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-binning-eukaryotic_env
[Pass] VEBA-binning-prokaryotic_env SUCCESSFULLY created.
[Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-binning-prokaryotic_env
[Pass] VEBA-binning-viral_env SUCCESSFULLY created.
[Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-binning-viral_env
[Pass] VEBA-biosynthetic_env SUCCESSFULLY created.
[Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-biosynthetic_env
[Pass] VEBA-classify_env SUCCESSFULLY created.
[Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-classify_env
[Pass] VEBA-cluster_env SUCCESSFULLY created.
[Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-cluster_env
[Pass] VEBA-database_env SUCCESSFULLY created.
[Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-database_env
[Pass] VEBA-mapping_env SUCCESSFULLY created.
[Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-mapping_env
[Pass] VEBA-phylogeny_env SUCCESSFULLY created.
[Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-phylogeny_env
[Pass] VEBA-preprocess_env SUCCESSFULLY created.
[Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-preprocess_env
[Pass] CHECKM2DB environment SUCCESSFULLY set in VEBA-binning-prokaryotic_env
[Pass] GTDBTK_DATA_PATH environment SUCCESSFULLY set in VEBA-classify_env
[Fail] CHECKVDB environment variable NOT set in VEBA-binning-viral_env

Is there anything I could do to make sure the Checkvdb environment var is correct?

[Bug] `Error: unexpected input in "_"` when running `VirFinder_wrapper.R` via `binning-viral.py` module

Describe the bug:
VirFinder_wrapper.R fails to run and throws the following R error: Error: unexpected input in "_"

Versions

v1.0.3

[Bug] v1.0.3* conda installation error VEBA-assembly_env perl-xml-parser-2.44-4

Dear Josh,

I'm setting up the new version of VEBA 1.03d on a colleague's computer and there seems to be a conflict when setting up the conda environment.

I'm specifically getting this error:

Could not solve for environment specs
Encountered problems while solving:
  - package perl-xml-parser-2.44-4 requires perl >=5.22.0,<5.23.0, but none of the providers can be installed

The environment can't be solved, aborting the operation

perl-xml-parser-2.44-4 seems to need an extremely narrow range of perl versions that it can work with. Any workaways around this?

Best,

Erfan

[Typo] Installation Guide bash update_environment_variables.sh

Hi Josh,

I've noticed a small typo here.

In the installation guide you mention:
bash update_environment_variables.sh
In the update guide you mention
bash veba/install/update_environment_variables.sh /path/to/veba_database

From the bash script, it seems like the database directory needs to be the first command argument so I thought it might be a typo.

I also wanted to ask if you prefer me to report typos by other means like email to stop the issues on GitHub from getting overcrowded.

Let me know which is best,

Erfan

[Bug] featureCounts error in binning-eukaryotic.py | ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.

Describe the bug:
ERROR: Subread failed to find the gene identifier attribute in the 9th column of the provided GTF file.
The specified gene identifier attribute is 'gene_id'
An example of attributes included in your GTF annotation is 'ID=k127_9079703_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.446;conf=98.06;score=17.06;cscore=14.96;sscore=2.10;rscore=-1.19;uscore=-0.92;tscore=4.21;contig_id=k127_9079703;gene_id=k127_9079703_1;gene_biotype=protein_coding;

See the file gene_models.eukaryotic.gff

Now see the log file for subread indicating that -g should be 'gene_id'

Versions

(base) conda activate VEBA-binning-eukaryotic_env
(VEBA-binning-eukaryotic_env) binning-eukaryotic.py -v
binning-eukaryotic.py v2023.10.16

Command used to produce error:
4__featurecounts
cat binning_euks/mendo_euks/intermediate/3__busco/filtered/genomes/*.gff > binning_euks/mendo_euks/tmp/gene_models.eukaryotic.gff && mkdir -p binning_euks/mendo_euks/tmp/featurecounts && ( /projects/navarro_lab/envs/VEBA-binning-eukaryotic_env/bin/featureCounts -G mendoG_euks.fasta -a binning_euks/mendo_euks/tmp/gene_models.eukaryotic.gff -o binning_euks/mendo_euks/intermediate/4__featurecounts/featurecounts.orfs.tsv -F GTF --tmpDir binning_euks/mendo_euks/tmp/featurecounts -T 30 -g gene_id -t CDS -p --countReadPairs T111_all_s.bam T121_all_s.bam T141_all_s.bam T311_all_s.bam T321_all_s.bam T331_all_s.bam T411_all_s.bam T421_all_s.bam T431_all_s.bam ) && gzip -f binning_euks/mendo_euks/intermediate/4__featurecounts/featurecounts.orfs.tsv && rm -rf binning_euks/mendo_euks/tmp/gene_models.eukaryotic.gff

Please provide the following files:
4__featurecounts.returncode.txt
4__featurecounts.o.txt
4__featurecounts.e.txt

The input files are too large to upload.

Thanks a lot in advance!

[Question] VEBA-biosynthetic_env/bin/mmseqs: No such file or directory

Hello there, first, yes, I have read the FAQ section :)
I am running the VEBA-biosynthetic pipeline, and the protocol fails at the 4__mmseq2_protein step.
When I check the log for this step, I get the following error:

/bin/bash: /projects/navarro_lab/envs/VEBA-biosynthetic_env/bin/mmseqs: No such file or directory

Also, I want to add that I reinstall VEBA to the latest release and when I do:
conda env list, I do not see VEBA-biosynthetic_env listed among my environments. Perhaps this is related to the mmseq issue?

Could you advice?
Bests!

[Question] Why isn't `binning-eukaryotic.py` finding any eukaryotic bins and why are the output files empty?

Hello, I've been trying to run the binning-eukaryotic.py script on a small assembly to predict eukaryotic MAGs. However, the process fails every time, regardless of whether I try to do the binning with metabat2 or concoct.

The input contigs are already predicted to be eukaryotic by whokaryote (a eukaryotic contig predictor that uses Tiara). I mapped the metagenomic samples with bwa-mem2 and used those sorted BAM files as input too. Yet, the output files are empty except for the "unbinned.fasta".

The output log file says:

Command:
echo '' > ../analyses/veba_eukaryotic_binning/binning/nano.G2/nano.G2/intermediate/1__binning_metabat2/scaffolds_to_bins.tsv && mkdir -p ../analyses/veba_eukaryotic_binning/binning/nano.G2/nano.G2/intermediate/1__binning_metabat2/bins && /opt/cesga/system/software/Core/veba-binning-eukaryotic_env/1.2.0/bin/coverm contig --threads 12 --methods metabat --bam-files ../analyses/veba_eukaryotic_binning/bam/nano.G2/PR173.nano.G2.euk.sorted.bam ../analyses/veba_eukaryotic_binning/bam/nano.G2/PR201.nano.G2.euk.sorted.bam ../analyses/veba_eukaryotic_binning/bam/nano.G2/PR217.nano.G2.euk.sorted.bam ../analyses/veba_eukaryotic_binning/bam/nano.G2/PR233.nano.G2.euk.sorted.bam ../analyses/veba_eukaryotic_binning/bam/nano.G2/PR249.nano.G2.euk.sorted.bam > ../analyses/veba_eukaryotic_binning/binning/nano.G2/nano.G2/intermediate/1__binning_metabat2/intermediate/coverage.tsv && /opt/cesga/system/software/Core/veba-binning-eukaryotic_env/1.2.0/bin/metabat2 -i ../data/coassembly_euk/nano.G2.corrected.fasta -o ../analyses/veba_eukaryotic_binning/binning/nano.G2/nano.G2/intermediate/1__binning_metabat2/bins/bin -a ../analyses/veba_eukaryotic_binning/binning/nano.G2/nano.G2/intermediate/1__binning_metabat2/intermediate/coverage.tsv -m 1500 -t 12 --minClsSize 2000000 --seed 1 --verbose 

Validating the following input files:
[=] File exists (830 MB): ../data/coassembly_euk/nano.G2.corrected.fasta
[=] File exists (2235 MB): ../analyses/veba_eukaryotic_binning/bam/nano.G2/PR173.nano.G2.euk.sorted.bam
[=] File exists (1832 MB): ../analyses/veba_eukaryotic_binning/bam/nano.G2/PR201.nano.G2.euk.sorted.bam
[=] File exists (3314 MB): ../analyses/veba_eukaryotic_binning/bam/nano.G2/PR217.nano.G2.euk.sorted.bam
[=] File exists (3013 MB): ../analyses/veba_eukaryotic_binning/bam/nano.G2/PR233.nano.G2.euk.sorted.bam
[=] File exists (3578 MB): ../analyses/veba_eukaryotic_binning/bam/nano.G2/PR249.nano.G2.euk.sorted.bam

Running. .. ... .....
Check log files to diagnose error:
cat ../analyses/veba_eukaryotic_binning/binning/nano.G2/nano.G2/intermediate/1__binning_metabat2/intermediate/log/1__binning_metabat2.*
No eukaryotic bins

And the metabat2 error log file:

ESC[33m[WARN]ESC[0m you may switch on flag -g/--remove-gaps to remove spaces
^MExecuting pipeline:   0%|          | 0/2 [00:00<?, ?it/s]^MExecuting pipeline:   0%|          | 0/2 [00:05<?, ?it/s]
cat: ../analyses/veba_eukaryotic_binning/binning/nano.G2/nano.G2/tmp/scaffolds_to_bins.tsv: No such file or directory

Metabat2 is executable and prints the help page when I load the veba environment, though. So it should work.

I am running Veba 1.2.0 and here's the command that I used:

name=nano.G2
outbam=../analyses/veba_eukaryotic_binning/bam/${name}
coassembly=../data/coassembly_euk/${name}.corrected.fasta
outbin=../analyses/veba_eukaryotic_binning/binning/${name}

binning-eukaryotic.py -f ${coassembly} -b ${outbam}/*euk.sorted.bam -o ${outbin} -p 12 -n ${name}

The output file from the SLURM server I am working on gives me a bit more details, but I am still clueless:

^MExecuting pipeline:   0%|          | 0/5 [00:00<?, ?it/s]^MExecuting pipeline:   0%|          | 0/5 [00:18<?, ?it/s]
Traceback (most recent call last):
  File "/opt/cesga/system/software/Core/veba-database/1.2.0/veba-1.2.0/src/binning-eukaryotic.py", line 1116, in <module>
    main()
  File "/opt/cesga/system/software/Core/veba-database/1.2.0/veba-1.2.0/src/binning-eukaryotic.py", line 1111, in main
    pipeline.execute(restart_from_checkpoint=opts.restart_from_checkpoint)
  File "/opt/cesga/system/software/Core/veba-binning-eukaryotic_env/1.2.0/lib/python3.8/site-packages/genopype/genopype.py", line 804, in execute
    validate_file_existence(output_filepaths, prologue="\nValidating the following output files:", f_verbose=self.f_verbose, whitelist_empty_files=attrs["whitelist_empty_output_files"])
  File "/opt/cesga/system/software/Core/veba-binning-eukaryotic_env/1.2.0/lib/python3.8/site-packages/genopype/genopype.py", line 90, in validate_file_existence
    assert size_bytes >= minimum_filesize, "The following file appears to be empty ({} bytes): {}".format(size_bytes, path)
AssertionError: The following file appears to be empty (0 bytes): ../analyses/veba_eukaryotic_binning/binning/nano.G2/nano.G2/intermediate/1__binning_metabat2/scaffolds_to_bins.tsv

I do not know what I am doing wrong. Are my input files the problem? Or is it something to do with genopype ? The IT guys in my server had to install BioPython separately because it wasn't installed by default and it raised an error with some other modules, so maybe that messed up the whole installation?

[Feature Request] Walkthrough for running the cluster module on genomes sourced from NCBI

Is your feature request related to a problem? Please describe.

I'll start by stating that I'd be happy to make a walkthrough for this use case if I can get it to work, as I think it could be broadly applicable.

My goal is to run the cluster module on around 70 Alteromonas macleodii genomes from NCBI. Here's what I've done so far:

I downloaded a test-run dataset comprised of 4 of them: ASM284987v1 (reference genome), ASM17263v2, ASM2380539v1, and ASM1482594v1. Specifically, I downloaded the RefSeq genome sequences (FASTA), annotation features (GFF), genomic coding sequences (FASTA), and proteins (FASTA).
I changed the extensions of the CDS files from .fna to .ffn to match the format of a previous dataset that ran successfully. This may have been a mistake, but we'll get to that in a bit.
I created a genomes_table.tsv file using a custom script (which I can include in the walkthrough)

prokaryotic	SAMN02603229	GCF_000172635.2	ncbi_dataset/data/GCF_000172635.2/GCF_000172635.2_ASM17263v2_genomic.fna	ncbi_dataset/data/GCF_000172635.2/protein.faa	ncbi_dataset/data/GCF_000172635.2/cds_from_genomic.ffn	ncbi_dataset/data/GCF_000172635.2/genomic.gff
prokaryotic	SAMN06093175	GCF_002849875.1	ncbi_dataset/data/GCF_002849875.1/GCF_002849875.1_ASM284987v1_genomic.fna	ncbi_dataset/data/GCF_002849875.1/protein.faa	ncbi_dataset/data/GCF_002849875.1/cds_from_genomic.ffn	ncbi_dataset/data/GCF_002849875.1/genomic.gff
prokaryotic	SAMN13259279	GCF_014825945.1	ncbi_dataset/data/GCF_014825945.1/GCF_014825945.1_ASM1482594v1_genomic.fna	ncbi_dataset/data/GCF_014825945.1/protein.faa	ncbi_dataset/data/GCF_014825945.1/cds_from_genomic.ffn	ncbi_dataset/data/GCF_014825945.1/genomic.gff
prokaryotic	SAMN28402784	GCF_023805395.1	ncbi_dataset/data/GCF_023805395.1/GCF_023805395.1_ASM2380539v1_genomic.fna	ncbi_dataset/data/GCF_023805395.1/protein.faa	ncbi_dataset/data/GCF_023805395.1/cds_from_genomic.ffn	ncbi_dataset/data/GCF_023805395.1/genomic.gff

It has the following columns (all but the last of which are mandatory according to the walkthroughs/documentation):

[organism_type][id_sample][id_mag][genome][proteins][cds][gene_models]
In this case, id_sample is the BioSample and id_mag is the RefSeq accession.
You may also notice that the extension for the genomes files is .fna instead of .fa. I don't think this should cause any issues but I'm noting it just in case.

I made then ran this cmd_cluster.sh script.

N=cluster
N_JOBS=12
CMD="source activate VEBA && veba --module cluster --params \" -i genomes_table.tsv -o cluster_output -p ${N_JOBS}\""
sbatch -J ${N} -p ind-shared -N 1 -c ${N_JOBS} --ntasks-per-node=1 -A [REDACTED] -o logs/${N}.o -e logs/${N}.e --export=ALL -t 30:00:00 --mem=24G --wrap="${CMD}"

The job failed almost immediately. Here is the error from the log file (log/1__global_clustering.e):

Organizing identifiers: 0it [00:00, ?it/s]
Traceback (most recent call last):
  File "/expanse/projects/jcl122/miniconda3/envs/VEBA-cluster_env/bin/scripts/global_clustering.py", line 735, in <module>
    main(sys.argv[1:])
  File "/expanse/projects/jcl122/miniconda3/envs/VEBA-cluster_env/bin/scripts/global_clustering.py", line 312, in main
    assert id_protein in protein_to_sequence, "CDS sequence identifier must be in protein fasta: {} from {}".format(id_protein, row["cds"])
AssertionError: CDS sequence identifier must be in protein fasta: lcl|NC_018632.1_cds_WP_039228897.1_1 from ncbi_dataset/data/GCF_000172635.2/cds_from_genomic.ffn
realpath: missing operand
Try 'realpath --help' for more information.

Here is the top of the file mentioned in the error:

(base) [REDACTED]$ head -2 ncbi_dataset/data/GCF_000172635.2/cds_from_genomic.ffn
>lcl|NC_018632.1_cds_WP_039228897.1_1 [gene=dnaA] [locus_tag=MASE_RS00005] [protein=chromosomal replication initiator protein DnaA] [protein_id=WP_039228897.1] [location=410..2065] [gbkey=CDS]
ATGTCCTTGTGGAACCAATGCCTTGAAAGATTGCGTCAAGAATTACCAACGCAGCAGTTTAGTATGTGGATACGACCGCT

After looking at the code chunk where the error was triggered (in global_clustering.py), I'm wondering if the error is due to unexpected formatting of the .ffn headers. It has a local sequence identifier (lcl|NC_018632.1_cds_WP_039228897.1_1), but maybe it's not being recognized. If you have thoughts on this please let me know.

Describe the solution you'd like

If you've noticed an error in my approach to preparing the genomes_table.tsv file or have any suggestions, please let me know. I think this could make for a good use-case walkthrough if I'm able to run it successfully.

Describe alternatives you've considered

I tried running this without changing the extensions of the CDS files from .fna to .ffn but I got the same error, so I think it's due to the formatting rather than the file extension itself.

Additional context

Directory structure before running cmd_cluster.sh:

(base) [REDACTED]$ tree
.
└── test_run
    ├── ncbi_dataset
    │   └── data
    │       ├── assembly_data_report.jsonl
    │       ├── dataset_catalog.json
    │       ├── data_summary.tsv
    │       ├── GCF_000172635.2
    │       │   ├── cds_from_genomic.fna
    │       │   ├── GCF_000172635.2_ASM17263v2_genomic.fna
    │       │   ├── genomic.gff
    │       │   └── protein.faa
    │       ├── GCF_002849875.1
    │       │   ├── cds_from_genomic.fna
    │       │   ├── GCF_002849875.1_ASM284987v1_genomic.fna
    │       │   ├── genomic.gff
    │       │   └── protein.faa
    │       ├── GCF_014825945.1
    │       │   ├── cds_from_genomic.fna
    │       │   ├── GCF_014825945.1_ASM1482594v1_genomic.fna
    │       │   ├── genomic.gff
    │       │   └── protein.faa
    │       └── GCF_023805395.1
    │           ├── cds_from_genomic.fna
    │           ├── GCF_023805395.1_ASM2380539v1_genomic.fna
    │           ├── genomic.gff
    │           └── protein.faa
    └── README.md

Directory structure after running cmd_cluster.sh:

(base) [REDACTED]$ tree
.
├── cluster_output
│   ├── checkpoints
│   │   └── 1__global_clustering
│   ├── commands.sh
│   ├── intermediate
│   │   └── 1__global_clustering
│   │       ├── checkpoints
│   │       ├── intermediate
│   │       │   └── prokaryotic
│   │       │       ├── clusters
│   │       │       ├── genome_identifiers.list
│   │       │       └── genomes.list
│   │       ├── log
│   │       ├── output
│   │       │   ├── pangenome_core_sequences
│   │       │   ├── pangenome_tables
│   │       │   └── serialization
│   │       └── tmp
│   ├── log
│   │   ├── 1__global_clustering.e
│   │   ├── 1__global_clustering.o
│   │   └── 1__global_clustering.returncode
│   ├── output
│   └── tmp
├── cmd_cluster.sh
├── genomes_table.tsv
├── global -> cluster_output/output/global
├── logs
│   ├── cluster.e
│   └── cluster.o
├── ncbi_dataset
│   └── data
│       ├── assembly_data_report.jsonl
│       ├── dataset_catalog.json
│       ├── data_summary.tsv
│       ├── GCF_000172635.2
│       │   ├── cds_from_genomic.ffn
│       │   ├── GCF_000172635.2_ASM17263v2_genomic.fna
│       │   ├── genomic.gff
│       │   └── protein.faa
│       ├── GCF_002849875.1
│       │   ├── cds_from_genomic.ffn
│       │   ├── GCF_002849875.1_ASM284987v1_genomic.fna
│       │   ├── genomic.gff
│       │   └── protein.faa
│       ├── GCF_014825945.1
│       │   ├── cds_from_genomic.ffn
│       │   ├── GCF_014825945.1_ASM1482594v1_genomic.fna
│       │   ├── genomic.gff
│       │   └── protein.faa
│       └── GCF_023805395.1
│           ├── cds_from_genomic.ffn
│           ├── GCF_023805395.1_ASM2380539v1_genomic.fna
│           ├── genomic.gff
│           └── protein.faa
└── README.md

[Bug] DAS_Tool starts but fails after calculating contig lengths (binning-prokaryotic.py)

Describe the bug:
binning-prokaryote fails at 7__dastools step. dastools appears to run/start but fails after calculating contig lengths.......

Versions
veba_binning-prokaryotic_1.4.1.sif and the equivalent conda install/env

Command used to produce error:
When running veba_binning-prokaryotic using the container veba_binning-prokaryotic_1.4.1.sif, dastools (step 7) does not complete, causing the workflow to fail.

I'm setting all of the inputs as per the docs:

export VEBA_DATABASE=/scratch3/bis068/veba/db

N_JOBS=32
N_ITER=1 #this is set to 1 to make the error show faster, set as 10 usually as per docs
ID=548348

OUT_DIR=veba_output/binning/prokaryotic/

FASTA=veba_output/binning/viral/${ID}/output/unbinned.fasta
BAM=veba_output/assembly/${ID}/output/mapped.sorted.bam

and then the command used to run the workflow module was:

singularity run veba_binning-prokaryotic_1.4.1.sif binning-prokaryotic.py -f ${FASTA} -b ${BAM} -n ${ID} -p ${N_JOBS} -o ${OUT_DIR} -m 1500 -I ${N_ITER} --skip_maxbin2

It's kind of weird, it looks like dastool starts running and then for some reason stops? I had the same issue when running via the conda environments. I've switched to the containers, conda envs become kind of complicated on our HPC, and thought this might solve the problem (or avoid it really I guess), but it didn't.

The previous steps all seem to run without problem using the containers (prepocess, assembly, bin-viral).

log files
Returncodes for all steps prior are "0" (1__ to 6__)

7__dastool.e.txt
7__dastool.o.txt
7__dastool.returncode.txt

[Question] Retaining k-mer and non-kmer hits in the bbduk.sh step of preprocess.py

Dear Josh,

I've found in the documentation of preprocess.py (v2022.01.19) There are two flags:

--retain_kmer_hits RETAIN_KMER_HITS
Retain reads that map to k-mer database. 0=No, 1=yes [Default: 0]
--retain_non_kmer_hits RETAIN_NON_KMER_HITS
Retain reads that do not map to k-mer database. 0=No, 1=yes [Default: 0]

These two contradict each other. It's slightly confusing as to how by default this process both does not retain kmer hits and also does not retain non kmer hits. This to me translates into "does not keep any of the reads at all", which I don't see how it could be possible.

I thought it might be a typo. If it is BBDuk lingo, some clarity in the walkthrough would be great.

Best,

Erfan

[Question] Is it possible to use the VEBA Microeukaryotic database with MMSEQS2 taxonomy?

Hello, it's me again.

I wondered if the Microeukaryotic DB can be used with the MMSEQS2 taxonomy module. I am interested in using this database to assign taxonomy at the contig level rather than the whole genome. However, MMSEQS2 requires tax dump-like information from the database. Since the MIcroeukaryotic DB has a similar taxonomy format to GTDB, do you think it can be done? Perhaps you already have the nodes.dmp and names.dmp files?

[Typo] Typo in installation README.md

Hi Josh,

Just a quick fix to make at https://github.com/jolespin/veba/blob/main/install/README.md

The new version directory is veba-1.0.3d so cd veba/install doesn't work. Might be useful for future versions and development to make it more implicit with something like cd ${VERSION_TAG}/install

Also has issues on

# Update the permissions
chmod 755 veba/src/*.py
chmod 755 veba/src/scripts/*

Best,

Erfan

[Bug] binning-prokaryotic.py module fails with scaffolds_to_bins.tsv file missing (penultimate step)

Validating the following output files:
Traceback (most recent call last):
  File "/opt/linux/rocky/8.x/x86_64/pkgs/veba/1.0.0/envs/VEBA-binning-prokaryotic_env/bin/binning-prokaryotic.py", line 2384, in <module>
    main()
  File "/opt/linux/rocky/8.x/x86_64/pkgs/veba/1.0.0/envs/VEBA-binning-prokaryotic_env/bin/binning-prokaryotic.py", line 2380, in main
    pipeline.execute(restart_from_checkpoint=opts.restart_from_checkpoint)
  File "/opt/linux/rocky/8.x/x86_64/pkgs/veba/1.0.0/envs/VEBA-binning-prokaryotic_env/lib/python3.8/site-packages/genopype/genopype.py", line 780, in execute
    validate_file_existence(output_filepaths, prologue="\nValidating the following output files:", f_verbose=self.f_verbose)
  File "/opt/linux/rocky/8.x/x86_64/pkgs/veba/1.0.0/envs/VEBA-binning-prokaryotic_env/lib/python3.8/site-packages/genopype/genopype.py", line 78, in validate_file_existence
    assert os.path.exists(path), "The following path does not exist: {}".format(path)
AssertionError: The following path does not exist: veba_output/binning/prokaryotic/sim_1/intermediate/63__cpr_adjustment/scaffolds_to_bins.tsv

if I copy the file scaffolds_to_bins.tsv to the folder I then get

Validating the following output files:
[=] File exists (105063 bytes): veba_output/binning/prokaryotic/sim_1/intermediate/63__cpr_adjustment/scaffolds_to_bins.tsv
Traceback (most recent call last):
  File "/opt/linux/rocky/8.x/x86_64/pkgs/veba/1.0.0/envs/VEBA-binning-prokaryotic_env/bin/binning-prokaryotic.py", line 2394, in <module>
    main()
  File "/opt/linux/rocky/8.x/x86_64/pkgs/veba/1.0.0/envs/VEBA-binning-prokaryotic_env/bin/binning-prokaryotic.py", line 2390, in main
    pipeline.execute(restart_from_checkpoint=opts.restart_from_checkpoint)
  File "/opt/linux/rocky/8.x/x86_64/pkgs/veba/1.0.0/envs/VEBA-binning-prokaryotic_env/lib/python3.8/site-packages/genopype/genopype.py", line 780, in execute
    validate_file_existence(output_filepaths, prologue="\nValidating the following output files:", f_verbose=self.f_verbose)
  File "/opt/linux/rocky/8.x/x86_64/pkgs/veba/1.0.0/envs/VEBA-binning-prokaryotic_env/lib/python3.8/site-packages/genopype/genopype.py", line 78, in validate_file_existence
    assert os.path.exists(path), "The following path does not exist: {}".format(path)
AssertionError: The following path does not exist: veba_output/binning/prokaryotic/sim_1/intermediate/63__cpr_adjustment/binned.list

[Question] ERROR: Paired-end reads were detected in single-end read library" at the featurecounts step. How do I indicate that there are paired-end r

Hi Josh,

During viral binning, I am getting this error "ERROR: Paired-end reads were detected in single-end read library" at the featurecounts step. How do I indicate that there are paired-end reads.

Cheers,

Matt

Installation error. bash install_veba.sh

I am trying to install veba for the past week but no success. I am following all the steps as in the installation guide, but I am getting following errors. What are the things i might be doing wrong?

(base) reserchui-iMac:install research$ bash install_veba.sh
Updating permissions for scripts in /Users/research/Desktop/tools/veba/install/../src
Retrieving notices: ...working... done
Collecting package metadata (current_repodata.json): done
Solving environment: done

All requested packages already installed.

Creating VEBA-annotate_env environment
*Copying VEBA modules into VEBA-annotate_env environment path
cp: /Users/research/miniconda3/envs/VEBA-annotate_env/bin is not a directory
*Copying VEBA utility scripts into VEBA-annotate_env environment path
cp: /Users/research/miniconda3/envs/VEBA-annotate_env/bin: No such file or directory
cp: /Users/research/Desktop/tools/veba/install/../src/scripts/: unable to copy extended attributes to /Users/research/miniconda3/envs/VEBA-annotate_env/bin: No such file or directory
cp: /Users/research/miniconda3/envs/VEBA-annotate_env/bin/filter_hmmsearch_results.py: No such file or directory

......
cp: /Users/research/miniconda3/envs/VEBA-preprocess_env/bin/deprecated/compile_viral_classifications.py: No such file or directory
*Symlinking VEBA utility scripts into VEBA-preprocess_env environment path
ln: /Users/research/miniconda3/envs/VEBA-preprocess_env/bin/: No such file or directory

\ / |______ |] ||
/ |______ |_____] | |
...............................
Installation Complete
...............................
Please run 'download_databases.sh' script available in the installation directory. If you need to redownload:
wget https://raw.githubusercontent.com/jolespin/veba/main/install/download_databases.sh
For help or instructions, refer to the installation walkthrough:
https://github.com/jolespin/veba/blob/main/install/README.md
(base) reserchui-iMac:install research$ conda activate VEBA-database_env

EnvironmentNameNotFound: Could not find conda environment: VEBA-database_env
You can list all discoverable environments with conda info --envs.

[Bug] Misspelling "itermediate" -> "intermediate" in binning-eukaryotic.py

veba/src/binning-eukaryotic.py

Line 221 in 03fefe9

os.path.join(output_directory, "itermediate"),

[Database] Delete compressed human reference after decompression

Do you want to also delete GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bowtie_index.tar.gz when this step has uncompressed the bowtie indexes?

veba/install/download_databases.sh

Line 114 in 0c9728c

    
           tar xvzf ${DATABASE_DIRECTORY}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bowtie_index.tar.gz -C ${DATABASE_DIRECTORY}/Contamination/grch38

Note that downloads for nr.gz could go faster with aspera too
eg
ascp -i $ASPERAKEY -QT -l 500m [email protected]:/blast/db/FASTA/nr.gz .

veba/install/download_databases.sh

Line 100 in 0c9728c

    
           wget -v -P ${DATABASE_DIRECTORY} https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz

[Bug] ModuleNotFoundError: No module named 'Bio'

Hi Josh,

When I run the assembly (default settings) with megahit (VEBA 1.0.3e) I get the following error:

Traceback (most recent call last): File "/home/eshekarriz/miniconda3/envs/VEBA-assembly_env/bin/scripts/fasta_to_saf.py", line 5, in <module> from Bio.SeqIO.FastaIO import SimpleFastaParser ModuleNotFoundError: No module named 'Bio' Renaming final.contigs.fa -> scaffolds.fasta 1

I don't think the biopython package has been put into the new conda environment for assembly.

I fixed it using conda install -c conda-forge biopython

Best,

Erfan

Issue with --skip_maxbin2 and --skip_concoct params

Hello, thank you for this useful package! I'm actually testing some modules and found an issue with the aforementioned params.

In lines 156 & 188 where you define the null command you should add 'seed' as an input parameter as you previously predefined it.

veba/src/binning-prokaryotic.py

Line 156 in 8ead3cb

    
           def get_maxbin2_null_cmd( input_filepaths, output_filepaths, output_directory, directories, opts, prefix):

veba/src/binning-prokaryotic.py

Line 188 in 8ead3cb

    
           def get_concoct_null_cmd( input_filepaths, output_filepaths, output_directory, directories, opts, prefix):

Have a nice day!

[Question] Why do I get permission error for binning-prokaryotic.py only and not for binning-viral modules?

Hi, I was trying to run VEBA on an HPC cluster for a small test dataset. I am getting some permission errors only when I run the prokaryotic-binning script 'binning-prokaryotic.py'. I followed the guided instructions and successfully ran the viral-binning script 'binning-viral.py' before the prokaryotic binning.

The error file when I run the prokaryotic-binning script gives the following message.

Traceback (most recent call last):
 File "/home/....../conda_mamba/envs/VEBA-binning-prokaryotic_env/bin/binning-prokaryotic.py", line 1326, in <module>
    `main()`
  File "/home/....../conda_mamba/envs/VEBA-binning-prokaryotic_env/bin/binning-prokaryotic.py", line 1286, in main
    directories["project"] = create_directory(opts.project_directory)
  File "/home/....../conda_mamba/envs/VEBA-binning-prokaryotic_env/lib/python3.8/site-packages/genopype/genopype.py", line 56, in create_directory
    os.makedirs(directory)
  File "/home/....../conda_mamba/envs/VEBA-binning-prokaryotic_env/lib/python3.8/os.py", line 213, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/home/....../conda_mamba/envs/VEBA-binning-prokaryotic_env/lib/python3.8/os.py", line 223, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/binning'

None of the following things worked for me :

Delete the /binning folder and re-run prokaryotic-binning
Change the folder permissions of the /binning folder to give write access.
Delete the viral-binning folder and run prokaryotic-binning first. It still gave the same error. But when I ran viral-binning afterward it was fine. No permission error for binning-viral

Am I doing something wrong? Did I not install the modules correctly or something?
I am using srun instead of sbatch to run all these cmds. I also tried it without the cluster but I still get the error.

My prokaryotic-binning cmd looks like this.
binning-prokaryotic.py -f ${FASTA} -b ${BAM} -n ${SampleID} -p ${N_JOBS} -m 1500 -I ${N_ITER} -o ${OUT_DIR}/binning/prokaryotic

My viral binning cmd looks like this.
binning-viral.py -f ${FASTA} -b ${BAM} -n ${SampleID} -p ${N_JOBS} -m 1500 -o ${OutDir}/binning/viral --include_provirus_detection

Thank you for your help!

[Bug] Package conflicts for VEBA-process_env.yml environment installation

Describe the bug:
A clear and concise description of what the bug is.

Versions
e.g.,

v1.0.3

Command used to produce error:

conda env create -n VEBA-preprocess_env -f VEBA-preprocess_env.yml

Please provide the following files:

Here's the error:

Package cryptography conflicts for:
pyopenssl==22.0.0=pyhd8ed1ab_0 -> cryptography[version='>=35.0']
cryptography==36.0.2=py37h38fbfac_0The following specifications were found to be incompatible with your system:

  - feature:/linux-64::__glibc==2.28=0
  - feature:|@/linux-64::__glibc==2.28=0
  - backports.zoneinfo==0.2.1=py37h5e8e339_4 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']
  - bbmap==38.92=he522d1c_0 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']
  - blast==2.5.0=hc0b0e79_3 -> libgcc-ng[version='>=4.9'] -> __glibc[version='>=2.17']
  - bmfilter==3.101=hc9558a2_3 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
  - bmtool==3.101=he1b5a44_3 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
  - boost-cpp==1.77.0=h359cf19_1 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']
  - boost==1.77.0=py37h796e4cb_0 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']
  - bowtie2==2.3.5.1=py37he513fc3_0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
  - brotlipy==0.7.0=py37h5e8e339_1003 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']
  - bzip2==1.0.8=h7f98852_4 -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']
  - c-ares==1.18.1=h7f98852_0 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']
  - cffi==1.15.0=py37hd667e15_1 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
  - cryptography==36.0.2=py37h38fbfac_0 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
  - curl==7.82.0=h7bff187_0 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
  - expat==2.4.7=h27087fc_0 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
  - fastp==0.23.2=h79da9fb_0 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']
  - fontconfig==2.13.96=h8e229c2_2 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
  - freetype==2.10.4=h0708190_1 -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']
  - hdf5==1.10.6=nompi_h6a2412b_1114 -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']
  - htslib==1.14=h9093b5e_0 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']
  - icu==69.1=h9c3ff4c_0 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']
  - isa-l==2.30.0=ha770c72_4 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
  - keyutils==1.6.1=h166bdaf_0 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
  - krb5==1.19.3=h3790be6_0 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
  - libcurl==7.82.0=h7bff187_0 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
  - libdeflate==1.7=h7f98852_5 -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']
  - libedit==3.1.20191231=he28a2e2_2 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
  - libev==4.33=h516909a_1 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
  - libffi==3.3=h58526e2_2 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
  - libiconv==1.16=h516909a_0 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
  - libnghttp2==1.47.0=h727a467_0 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
  - libnsl==2.0.0=h7f98852_0 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']
  - libopenblas==0.3.18=pthreads_h8fe5266_0 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']
  - libpng==1.6.37=h21135ba_2 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
  - libssh2==1.10.0=ha56f1ee_2 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']
  - libuuid==2.32.1=h7f98852_1000 -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']
  - libxml2==2.9.12=h885dcf4_1 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']
  - libzlib==1.2.11=h36c2ea0_1013 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
  - lz4-c==1.9.3=h9c3ff4c_1 -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']
  - ncbi-ngs-sdk==2.11.2=pl5321h629fbf0_1 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
  - ncurses==6.3=h9c3ff4c_0 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']
  - numpy==1.21.5=py37hf2998dd_0 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']
  - openssl==1.1.1o=h166bdaf_0 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
  - ossuuid==1.6.2=hf484d3e_1000 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
  - perl-alien-build==2.48=pl5321hec16e2b_0 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
  - perl-alien-libxml2==0.17=pl5321hec16e2b_0 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
  - perl-data-dumper==2.183=pl5321hec16e2b_1 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
  - perl-encode==3.17=pl5321hec16e2b_0 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
  - perl-mime-base64==3.16=pl5321hec16e2b_2 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
  - perl-pathtools==3.75=pl5321hec16e2b_3 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
  - perl-xml-libxml==2.0207=pl5321h661654b_0 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
  - perl==5.32.1=2_h7f98852_perl5 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']
  - python==3.7.11=h12debd9_0 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
  - readline==8.1=h46c0cb4_0 -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']
  - samtools==1.15=h3843a85_0 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
  - scandir==1.10.0=py37h5e8e339_4 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']
  - sqlite==3.37.1=h4ff8645_0 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
  - sra-tools==2.11.0=pl5321ha49a11a_3 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
  - srprism==2.4.24=h96824bc_3 -> libgcc-ng[version='>=4.9'] -> __glibc[version='>=2.17']
  - tbb==2020.2=h4bd325d_4 -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']
  - tk==8.6.12=h27826a3_0 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']
  - trf==4.09.1=hec16e2b_2 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
  - xz==5.2.5=h516909a_1 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
  - zlib==1.2.11=h36c2ea0_1013 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
  - zstd==1.5.2=ha95c52a_0 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']

Your installed version is: 2.28

[Feature Request] Add checkpoint after 1__assembly assembly (before index building)

Hi Josh,

In my last bug report, I saw that biopython was missing. All my assemblies were finished but the 1__assembly step wasn't followed through because of the missing module.

If I re-run the assembly.py command, it will start assembling from scratch.

I was wondering if there could be a checkpoint in the Genopype right after assembly is done. My assembly took around 4 days and now I have to run it again because the first step of the 1__assembly checkpoint is to remove all previous files.

Would appreciate it if an extra checkpoint could be added right after assembly!

Best,

Erfan

[Question] <Is there documentation on how to run VEBA using the docker containers?>

Please confirm that you've checked the FAQ section:
https://github.com/jolespin/veba/blob/main/FAQ.md

If you still have a question, feel free to ask here.

I can't run docker on the cluster. But I have been able to run singularity provided I pull the containers to the cluster using singularity as shown here:

singularity pull docker://jolespin/veba_binning-viral:2.0.0

Could you tell where I can find the details on how to run that container: veba_binning-viral:2.0.0 ? In particular, how do I set these parameters --path_config and --veba_database?

Do I need to run veba_database container first; if so, then how do I do that, I mean, which command should I run?

$ singularity run veba_binning-viral_2.0.0.sif binning-viral.py -h
usage: binning-viral.py -f  -l  -n  -o  [Requires at least 20GB]

    Running: binning-viral.py v2023.11.30 via Python v3.10.9 | /opt/conda/bin/python

options:
  -h, --help            show this help message and exit

Required I/O arguments:
  -f FASTA, --fasta FASTA
                        path/to/scaffolds.fasta
  -n NAME, --name NAME  Name of sample
  -o PROJECT_DIRECTORY, --project_directory PROJECT_DIRECTORY
                        path/to/project_directory [Default: veba_output/binning/viral]
  -b BAM [BAM ...], --bam BAM [BAM ...]
                        path/to/mapped.sorted.bam files separated by spaces.

Utility arguments:
  --path_config PATH_CONFIG
                        path/to/config.tsv [Default: CONDA_PREFIX]
  -p N_JOBS, --n_jobs N_JOBS
                        Number of threads [Default: 1]
  --random_state RANDOM_STATE
                        Random state [Default: 0]
  --restart_from_checkpoint RESTART_FROM_CHECKPOINT
                        Restart from a particular checkpoint [Default: None]
  -v, --version         show program's version number and exit

Database arguments:
  --veba_database VEBA_DATABASE
                        VEBA database location.  [Default: $VEBA_DATABASE environment variable]

Binning arguments:
  -a ALGORITHM, --algorithm ALGORITHM
                        Binning algorithm to use: {genomad, virfinder}  [Default: genomad]
  -m MINIMUM_CONTIG_LENGTH, --minimum_contig_length MINIMUM_CONTIG_LENGTH
                        Minimum contig length.  [Default: 1500]
  --include_provirus_detection
                        Include provirus viral detection

Gene model arguments:
  --prodigal_genetic_code PRODIGAL_GENETIC_CODE
                        Prodigal-GV -g translation table (https://github.com/apcamargo/prodigal-gv) [Default: 11]

geNomad arguments
Using --relaxed mode by default.  Adjust settings according to the following table: https://portal.nersc.gov/genomad/post_classification_filtering.html#default-parameters-and-presets:
  --genomad_qvalue GENOMAD_QVALUE
                        Maximum accepted false discovery rate. [Default: 1.0; 0.0 < x ≤ 1.0]
  --sensitivity SENSITIVITY
                        MMseqs2 marker search sensitivity. Higher values will annotate more proteins, but the search will be slower and consume more memory. [Default: 4.0; x ≥ 0.0]
  --splits SPLITS       Split the data for the MMseqs2 search. Higher values will reduce memory usage, but will make the search slower. If the MMseqs2 search is failing, try to increase the number of splits. Also used for VirFinder. [Default: 0; x ≥ 0]
  --composition COMPOSITION
                        Method for estimating sample composition. (auto|metagenome|virome) [Default: auto]
  --minimum_score MINIMUM_SCORE
                        Minimum score to flag a sequence as virus or plasmid. By default, the sequence is classified as virus/plasmid if its virus/plasmid score is higher than its chromosome score, regardless of the value. [Default: 0; 0.0 ≤ x ≤ 1.0]
  --minimum_plasmid_marker_enrichment MINIMUM_PLASMID_MARKER_ENRICHMENT
                        Minimum allowed value for the plasmid marker enrichment score, which represents the total enrichment of plasmid markers in the sequence. Sequences with multiple plasmid markers will have higher values than the ones that encode few or no markers.[Default: -100]
  --minimum_virus_marker_enrichment MINIMUM_VIRUS_MARKER_ENRICHMENT
                        Minimum allowed value for the virus marker enrichment score, which represents the total enrichment of plasmid markers in the sequence. Sequences with multiple plasmid markers will have higher values than the ones that encode few or no markers. [Default: -100]
  --minimum_plasmid_hallmarks MINIMUM_PLASMID_HALLMARKS
                        Minimum number of plasmid hallmarks in the identified plasmids.  [Default: 0; x ≥ 0]
  --minimum_virus_hallmarks MINIMUM_VIRUS_HALLMARKS
                        Minimum number of virus hallmarks in the identified viruses.  [Default: 0; x ≥ 0]
  --maximum_universal_single_copy_genes MAXIMUM_UNIVERSAL_SINGLE_COPY_GENES
                        Maximum allowed number of universal single copy genes (USCGs) in a virus or a plasmid. Sequences with more than this number of USCGs will not be classified as viruses or plasmids, regardless of their score.  [Default: 100]
  --genomad_options GENOMAD_OPTIONS
                        geNomad | More options (e.g. --arg 1 ) [Default: '']

VirFinder arguments:
  --virfinder_pvalue VIRFINDER_PVALUE
                        VirFinder statistical test threshold [Default: 0.05]
  --mmseqs2_evalue MMSEQS2_EVALUE
                        Maximum accepted E-value in the MMseqs2 search. Used by genomad annotate when VirFinder is used as binning algorithm [Default: 1e-3]
  --use_qvalue          Use qvalue (FDR) instead of pvalue
  --use_minimal_database_for_taxonomy
                        Use a smaller marker database to annotate proteins. This will make execution faster but sensitivity will be reduced.
  --virfinder_options VIRFINDER_OPTIONS
                        VirFinder | More options (e.g. --arg 1 ) [Default: '']

CheckV arguments:
  --checkv_options CHECKV_OPTIONS
                        CheckV | More options (e.g. --arg 1 ) [Default: '']
  --multiplier_viral_to_host_genes MULTIPLIER_VIRAL_TO_HOST_GENES
                        Minimum number of viral genes [Default: 5]
  --checkv_completeness CHECKV_COMPLETENESS
                        Minimum completeness [Default: 50.0]
  --checkv_quality CHECKV_QUALITY
                        Comma-separated string of acceptable arguments between {High-quality,Medium-quality,Complete} [Default: High-quality,Medium-quality,Complete]
  --miuvig_quality MIUVIG_QUALITY
                        Comma-separated string of acceptable arguments between {High-quality,Medium-quality,Complete} [Default: High-quality,Medium-quality,Complete]

featureCounts arguments:
  --long_reads          featureCounts | Use this if long reads are being used
  --featurecounts_options FEATURECOUNTS_OPTIONS
                        featureCounts | More options (e.g. --arg 1 ) [Default: ''] | http://bioinf.wehi.edu.au/featureCounts/

Copyright 2021 Josh L. Espinoza ([email protected])

Thanks in advance for your help.

[Question] Where have all the mRNAs gone?

Hi @jolespin,

I've used your script compile_metaeuk_identifiers.py mentioned here. It works seamlessly, however I'm wondering where all the mRNAs have gone. See here:

…from MetaEuk

scaffold2569_size8692	MetaEuk	gene	6442	7410	344	-	.	Target_ID=UniRef90_UPI0023DCB954;TCS_ID=UniRef90_UPI0023DCB954|scaffold2569_size8692|-|6441
scaffold2569_size8692	MetaEuk	mRNA	6442	7410	344	-	.	Target_ID=UniRef90_UPI0023DCB954;TCS_ID=UniRef90_UPI0023DCB954|scaffold2569_size8692|-|6441_mRNA;Parent=UniRef90_UPI0023DCB954|scaffold2569_size8692|-|6441
scaffold2569_size8692	MetaEuk	exon	7231	7410	64	-	.	Target_ID=UniRef90_UPI0023DCB954;TCS_ID=UniRef90_UPI0023DCB954|scaffold2569_size8692|-|6441_exon_0;Parent=UniRef90_UPI0023DCB954|scaffold2569_size8692|-|6441_mRNA
scaffold2569_size8692	MetaEuk	CDS	7231	7410	64	-	.	Target_ID=UniRef90_UPI0023DCB954;TCS_ID=UniRef90_UPI0023DCB954|scaffold2569_size8692|-|6441_CDS_0;Parent=UniRef90_UPI0023DCB954|scaffold2569_size8692|-|6441_exon_0
scaffold2569_size8692	MetaEuk	exon	6886	7158	149	-	.	Target_ID=UniRef90_UPI0023DCB954;TCS_ID=UniRef90_UPI0023DCB954|scaffold2569_size8692|-|6441_exon_1;Parent=UniRef90_UPI0023DCB954|scaffold2569_size8692|-|6441_mRNA
scaffold2569_size8692	MetaEuk	CDS	6886	7158	149	-	.	Target_ID=UniRef90_UPI0023DCB954;TCS_ID=UniRef90_UPI0023DCB954|scaffold2569_size8692|-|6441_CDS_1;Parent=UniRef90_UPI0023DCB954|scaffold2569_size8692|-|6441_exon_1

…and from compile_metaeuk_identifiers.py

scaffold10000_size3469	MetaEuk	gene	518	2727	1039.0	+	.	target_id=UniRef90_UPI0023DC3A1D;tcs_id=UniRef90_UPI0023DC3A1D|scaffold10000_size3469|+;contig_id=scaffold10000_size3469;gene_id=scaffold10000_size3469_517:2726(+);ID=scaffold10000_size3469_517:2726(+);
scaffold10000_size3469	MetaEuk	CDS	518	2727	1039.0	+	.	target_id=UniRef90_UPI0023DC3A1D;tcs_id=UniRef90_UPI0023DC3A1D|scaffold10000_size3469|+;contig_id=scaffold10000_size3469;gene_id=scaffold10000_size3469_517:2726(+);ID=scaffold10000_size3469_517:2726(+);Parent=scaffold10000_size3469_517:2726(+);
scaffold10000_size3469	MetaEuk	exon	518	712	1039.0	+	.	target_id=UniRef90_UPI0023DC3A1D;tcs_id=UniRef90_UPI0023DC3A1D|scaffold10000_size3469|+;contig_id=scaffold10000_size3469;gene_id=scaffold10000_size3469_517:2726(+);ID=scaffold10000_size3469_517:2726(+);Parent=scaffold10000_size3469_517:2726(+);exon_id=scaffold10000_size3469_517:2726(+).exon_1
scaffold10000_size3469	MetaEuk	exon	760	1704	1039.0	+	.	target_id=UniRef90_UPI0023DC3A1D;tcs_id=UniRef90_UPI0023DC3A1D|scaffold10000_size3469|+;contig_id=scaffold10000_size3469;gene_id=scaffold10000_size3469_517:2726(+);ID=scaffold10000_size3469_517:2726(+);Parent=scaffold10000_size3469_517:2726(+);exon_id=scaffold10000_size3469_517:2726(+).exon_2
scaffold10000_size3469	MetaEuk	exon	1762	2055	1039.0	+	.	target_id=UniRef90_UPI0023DC3A1D;tcs_id=UniRef90_UPI0023DC3A1D|scaffold10000_size3469|+;contig_id=scaffold10000_size3469;gene_id=scaffold10000_size3469_517:2726(+);ID=scaffold10000_size3469_517:2726(+);Parent=scaffold10000_size3469_517:2726(+);exon_id=scaffold10000_size3469_517:2726(+).exon_3
scaffold10000_size3469	MetaEuk	exon	2109	2507	1039.0	+	.	target_id=UniRef90_UPI0023DC3A1D;tcs_id=UniRef90_UPI0023DC3A1D|scaffold10000_size3469|+;contig_id=scaffold10000_size3469;gene_id=scaffold10000_size3469_517:2726(+);ID=scaffold10000_size3469_517:2726(+);Parent=scaffold10000_size3469_517:2726(+);exon_id=scaffold10000_size3469_517:2726(+).exon_4

I know both files are not in the same order. Nevertheless, I cannot find any mRNAs in the above file -- but I need them for downstream analysis?!

Cheers @bheimbu

[Bug] Empty `identifier_mapping.metaeuk.tsv` file generated from `filter_busco_results.py` of `binning-eukaryotic.py`

Describe the bug:
MetaEuk runs fine and so does the id parsing of compile_metaeuk_identifiers.py which produces an identifier_mapping.metaeuk.tsv file. When subsetting this table with filter_busco_results.py by the filtered MAGs, it produces an empty table.

Versions
e.g.,

v1.0.3

[Question] Why did I get a KeyError: 'TMPDIR' when running binning-prokaryotic.py?

Dear Josh,

Hope all is well. Erfan here again. I wanted to also kindly request for there to be tutorials for people hoping to insert themselves at a specific point of the pipeline.

Since the pipeline is very structural, let's say I've done my assembly using different software, and now want to do my binning using your pipeline, I need to create many directories, and subdirectories, which can get confusing.
In my case, I ran into an issue with the prokaryotic binning step. My code:

Python version: 3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:59:51)  [GCC 9.4.0]
Python path: /home/eshekarriz/miniconda3/envs/VEBA-binning-prokaryotic_env/bin/python
Script version: 2022.10.26
VEBA Database: /home/eshekarriz/hdd_16t/database/veba
Moment: 2022-11-15 02:18:13
Directory: /home/eshekarriz/hdd_16t/analysis/SeepFungiNet2022/metagenomics/veba
Commands:
['/home/eshekarriz/miniconda3/envs/VEBA-binning-prokaryotic_env/bin/binning-prokaryotic.py', '-f', 'veba_output/assembly/SMM2/output/scaffolds.fasta', '-b', 'veba_output/assembly/SMM2/output/mapped.sorted.bam', '-n', 'SMM2', '-p', '24', '-o', 'veba_output/binning/prokaryotic/', '-m', '1500', '-I', '10']

The error I got:
Traceback (most recent call last): File "/home/eshekarriz/miniconda3/envs/VEBA-binning-prokaryotic_env/bin/binning-prokaryotic.py", line 2380, in <module> main() File "/home/eshekarriz/miniconda3/envs/VEBA-binning-prokaryotic_env/bin/binning-prokaryotic.py", line 2370, in main pipeline = create_pipeline( File "/home/eshekarriz/miniconda3/envs/VEBA-binning-prokaryotic_env/bin/binning-prokaryotic.py", line 1859, in create_pipeline cmd = get_checkm_cmd(**params) File "/home/eshekarriz/miniconda3/envs/VEBA-binning-prokaryotic_env/bin/binning-prokaryotic.py", line 661, in get_checkm_cmd "--tmpdir {}".format(os.environ["TMPDIR"]), # Hack around: OSError: AF_UNIX path too long File "/home/eshekarriz/miniconda3/envs/VEBA-binning-prokaryotic_env/lib/python3.8/os.py", line 675, in __getitem__ raise KeyError(key) from None KeyError: 'TMPDIR'

I'm guessing this is because there are intermediate files that are created in the previous steps that I'm missing. Correct me if I'm wrong but I've also checked the source code and it seems like I should have a TMPDIR in my os environment that I don't have in check.

Would appreciate your help and also perhaps tutorials on how to "insert" ourselves in a specific module from a different pipeline seamlessly.

Best,

Erfan

[Typo] In Viral Metatranscriptomics Tutorial

In the viral transcriptomic walkthrough:

CMD="source activate VEBA-assembly_env && assembly.py -1 ${R1} -2 ${R2} -n ${ID} -o ${OUT_DIR} -p ${N_JOBS} --P rnaspades.py"

The option tag should be -P instead of --P

A quick catch. Hope it helps.

Best,

Erfan

[Bug] Error when creating VEBA environments

Hi! I installed VEBA and am now checking if the installation was successful. I have two environments created (VEBA-assembly_env and VEBA-database_env), but not the following:

VEBA-amplicon_env
VEBA-annotate_env
VEBA-binning-eukaryotic_env
VEBA-binning-prokaryotic_env
VEBA-binning-viral_env
VEBA-classify_env
VEBA-cluster_env
VEBA-mapping_env
VEBA-phylogeny_env
VEBA-preprocess_env

Does this mean there was a problem with my installation? What do you recommend? Thank you

Please confirm that you've checked the FAQ section:
https://github.com/jolespin/veba/blob/main/FAQ.md

If you still have a question, feel free to ask here.

Question - Length mismatch during merging step of annotation

Hi Josh,

I am getting this error during the merging step (stops at 88%) of the annotation module.

Reading identifier mapping table: veba_output/cluster/output/global/identifier_mapping.proteins.tsv.gz
Traceback (most recent call last):
File "/scratch/pawsey0159/mcampbell/conda/envs/VEBA-annotate_env/bin/scripts/merge_annotations.py", line 384, in
main()
File "/scratch/pawsey0159/mcampbell/conda/envs/VEBA-annotate_env/bin/scripts/merge_annotations.py", line 129, in main
df_identifier_mapping.columns = ["id_protein", "id_contig", "id_genome"]
File "/scratch/pawsey0159/mcampbell/conda/envs/VEBA-annotate_env/lib/python3.7/site-packages/pandas/core/generic.py", line 5500, in setattr
return object.setattr(self, name, value)
File "pandas/_libs/properties.pyx", line 70, in pandas._libs.properties.AxisProperty.set
File "/scratch/pawsey0159/mcampbell/conda/envs/VEBA-annotate_env/lib/python3.7/site-packages/pandas/core/generic.py", line 766, in _set_axis
self._mgr.set_axis(axis, labels)
File "/scratch/pawsey0159/mcampbell/conda/envs/VEBA-annotate_env/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 216, in set_axis
self._validate_set_axis(axis, new_labels)
File "/scratch/pawsey0159/mcampbell/conda/envs/VEBA-annotate_env/lib/python3.7/site-packages/pandas/core/internals/base.py", line 58, in _validate_set_axis
f"Length mismatch: Expected axis has {old_len} elements, new "
ValueError: Length mismatch: Expected axis has 6 elements, new values have 3 elements

Have you seen this before?

Cheers,

Matt

jolespin / veba Goto Github PK

veba's People

Contributors

Stargazers

Watchers

Forkers

veba's Issues

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

All requested packages already installed.

Recommend Projects

Recommend Topics

Recommend Org