linxingchen / cobra Goto Github PK

A tool to raise the quality of viral genomes assembled from short-read metagenomes via resolving and joining of contigs fragmented during de novo assembly.

License: MIT License

Python 100.00%

metagenomics viral-metagenomics genome-assembly manual-curation short-read-sequencing

cobra's Introduction

COBRA (Contig Overlap Based Re-Assembly) is a bioinformatics tool to get higher quality viral genomes assembled from metagenomes of short paired-end reads. COBRA was written in Python. COBRA has so far only been tested on assembled contigs and scaffolds from metaSPAdes, IDBA_UD, and MEGAHIT.

# Developed by Dr. LinXing Chen
# University of California, Berkeley
# The Banfield Lab
# Email: [email protected]

Versions

v1.2.2 (released on 2023-09-03) - initial release
v1.2.3 (released on 2024-02-26)

The GC function issue due to the update of Biopython.
The abnormal exit in the middle of processing some samples.
If none of the queries was extended, the process will break. If your runs do not have the expected output files, see the log file.

Citation

The paper is out at Nature Microbiology (https://www.nature.com/articles/s41564-023-01598-2). Please cite as follows if you find COBRA is helpful for your analyses. Chen, L., Banfield, J.F. COBRA improves the completeness and contiguity of viral genomes assembled from metagenomes. Nat Microbiol (2024). https://doi.org/10.1038/s41564-023-01598-2

Introduction

1. Why metagenomic contigs are fragmented?

The genomes assembled from short paired-end reads based metagenomes are usually fragmented due to (1) intra-genome repeats, (2) inter-genome shared region, and (3) within-population variations, as the widely utilized assemblers based on de Bruijn graphs, e.g., metaSPAdes, IDBA_UD and MEGAHIT, tend to have a breaking point when multiple paths are available instead of making risky extension (see example in Figure 1).

Figure 1. Example of how assemblers break in assembly when within-population occurs.

2. Contigs may be joined with expected end overlap.

According to the principles of the abovementioned assemblers, the broken contigs have an end overlap with determined length, that is the max-kmer (maxK hereafter) used in de nono assembly for metaSPAdes and MEGAHIT, and the maxK-1 for IDBA_UD, which we termed as "expected overlap length" (Figures 1 and 2).

Note: as COBRA will use the information provided by paired-end reads, thus only those samples sequenced by paired-end technology should work.

Figure 2. The "expected overlap length" has been documented in manual genome curation, see Chen et al. 2020 for details.

How COBRA works

COBRA determines the "expected overlap length" (both the forward direction and reverse complement direction) for all the contigs from an assembly, then looks for the valid joining path for each query that users provide (should be a fraction of the whole assembly) based on a list of features including contig coverage, contig overlap relationships, and contig continuity (based on paired-end reads mapping) (Figure 3).

Note that scaffolds (for example, metaSPAdes assembly) could be used as input for COBRA extension, however, we suggest not using scaffolds from IDBA_UD as the potential errors in the scaffolding step (see Chen et al. 2020 for details). Thus, for IDBA_UD and MEGAHIT assembly, the contigs should be used.

Given that COBRA has only tested for contigs/scaffolds from IDBA_UD, metaSPAdes and MEGAHIT, it will be risky to use it on contigs/scaffolds from any other assemblers.

Figure 3. The workflow of COBRA.

Dependencies

COBRA is a Python script (tested for version 3.7 or higher) that uses a list of frequently used Python packages including:

Bio
Bio.Seq
collections
argparse
math
pysam
time

The only third-party software that COBRA will use is BLASTn.

Installation

COBRA could now be installed via different ways.

(1) git

git clone https://github.com/linxingchen/cobra.git

cd cobra

python cobra.py -h

(2) pip

pip install cobra-meta

To confirm the installment,

cobra-meta -h

which shows your something like this

usage: cobra-meta [-h] -q QUERY -f FASTA -a {idba,megahit,metaspades} -mink MINK -maxk MAXK -m MAPPING -c COVERAGE [-lm LINKAGE_MISMATCH] [-o OUTPUT] [-t THREADS] [-v]

...

(3) conda

conda create -n cobra python=3.8

conda activate cobra

conda install bioconda::cobra-meta or conda install linxingchen1987::cobra-meta

To confirm the installment,

cobra-meta -h

which shows your something like this

usage: cobra-meta [-h] -q QUERY -f FASTA -a {idba,megahit,metaspades} -mink MINK -maxk MAXK -m MAPPING -c COVERAGE [-lm LINKAGE_MISMATCH] [-o OUTPUT] [-t THREADS] [-v]

...

Update

pip install --upgrade cobra-meta

conda

conda activate cobra (if cobra is the conda environment name)

conda update cobra-meta

Input files

(1) COBRA needs four files as inputs, i.e.,

-f/--fasta: A fasta format file containing all the contigs from a single assembly, note that IDBA_UD and MEGAHIT usually save contigs with a minimum length of 200 bp.
-c/--coverage: a two columns (separated by tab) file of the sequencing coverage of all contigs in the -f/--fasta file, example below:

contig-140_1    42.1388
contig-140_2    14.6023
contig-140_3    15.4817
contig-140_4    41.2746
...

-q/--query: the query contigs that the user wants COBRA to extend, could be provided in a fasta format file, or a one-column text file with the names of the query contigs. Please make sure the names are exactly the same format as in the -f/--fasta file, otherwise, COBRA may have problems extending them.
-m/--mapping: the paired-end reads mapping file of all contigs in the -f/--fasta file, could be sam or bam file.

(2) and three parameters

-a/--assembler: the name of the de novo assembler used, currently only 'idba' (for IDBA_UD), 'metaspades' (for metaSPAdes), and 'megahit' (for MEGAHIT).
-maxk/--maxk: the largest kmer used in de novo assembly.
-mink/--mink: the smallest kmer used in de novo assembly.

(3) Optional flags

-lm/--linkage_mismatch: the number of read mapping mismatches allowed when determining if two contigs were spanned by paired reads.
-o/--output: the name of the output folder, otherwise it will be "{-q/--query}.COBRA" if not provided.
-t/--threads: the number of threads used for BLASTn search.

How to obtain the mapping file

The mapping file could be obtained with tools like Bowtie2 and BBMap, please refer to the manual descriptions for details of the tools. Below is the general way to get the sorted sam/bam file, you thus need to be available to samtools (which could be get here - https://github.com/samtools/samtools).

For example,

contig file = "contigs.fasta"
first read file = "R1.fastq.gz"
second read file = "R2.fastq.gz"

(1) with Bowtie2 (https://github.com/BenLangmead/bowtie2)

bowtie2-build contigs.fasta contigs.fasta
bowtie2 -p 16 -x contigs.fasta -1 R1.fastq.gz -2 R2.fastq.gz -S output.sam && samtools view -bS output.sam | samtools sort -o sorted_output.bam -

(2) with BBMap (https://github.com/BioInfoTools/BBMap)

bbmap.sh ref=contigs.fasta in1=R1.fastq.gz in2=R2.fastq.gz threads=16 out=output.sam (good)
samtools view -bS output.sam > output.bam
samtools sort -o sorted_output.bam output.bam

How to obtain the coverage file

(1) with jgi_summarize_bam_contig_depths

Once the sorted sam or bam file is ready, the tool of "jgi_summarize_bam_contig_depths" from MetaBAT (https://bitbucket.org/berkeleylab/metabat/src/master/), or could be used to obtain the coverage file, the resulting profile should be transferred to get a two-column file divided by tab.

jgi_summarize_bam_contig_depths --outputDepth original.coverage.txt *sam
jgi_summarize_bam_contig_depths --outputDepth original.coverage.txt *bam

The output file from jgi_summarize_bam_contig_depths could be converted to a two-column file divided by tab using the script provided in this study (coverage.transfer.py).

python coverage.transfer.py -i original.coverage.txt -o coverage.txt

(2) CoverM

CoverM is a fast DNA read coverage and relative abundance calculator focused on metagenomics applications. Usage could be found here (https://github.com/wwood/CoverM).

(3) pyCoverM

pyCoverM is a simple Python interface to CoverM's fast coverage estimation functions, which could be found here (https://github.com/apcamargo/pycoverm).

How to run

(1) The users can only specify the required parameters:

cobra-meta -f input.fasta -q query.fasta -c coverage.txt -m mapping.sam -a idba -mink 20 -maxk 140

(2) The users could also include the optional parameters like output name (-o), mismatch of mapped reads for linkage identification (-mm)

cobra-meta -f all.contigs.fasta -q query.fasta -o query.fasta.COBRA.out -c coverage.txt -m mapping.sam -a idba -mink 20 -maxk 140 -mm 2

cobra-meta -f all.contigs.fasta -q query.fasta -o query.fasta.COBRA.out -c coverage.txt -m mapping.sam -a metaspades -mink 21 -maxk 127 -mm 2

cobra-meta -f all.contigs.fasta -q query.fasta -o query.fasta.COBRA.out -c coverage.txt -m mapping.sam -a megahit -mink 21 -maxk 141 -mm 2

Output files

Below is a general list of output files in the output folder:

COBRA_category_i_self_circular_queries_trimmed.fasta
COBRA_category_i_self_circular_queries_trimmed.fasta.summary.txt
COBRA_category_ii_extended_circular_unique (folder)
COBRA_category_ii_extended_circular_unique.fasta
COBRA_category_ii_extended_circular_unique.fasta.summary.txt
COBRA_category_ii_extended_circular_unique_joining_details.txt
COBRA_category_ii_extended_failed.fasta
COBRA_category_ii_extended_failed.fasta.summary.txt
COBRA_category_ii_extended_partial_unique (folder)
COBRA_category_ii_extended_partial_unique.fasta
COBRA_category_ii_extended_partial_unique.fasta.summary.txt
COBRA_category_ii_extended_partial_unique_joining_details.txt
COBRA_category_iii_orphan_end.fasta
COBRA_category_iii_orphan_end.fasta.summary.txt
COBRA_joining_status.txt
COBRA_joining_summary.txt
intermediate.files (folder)
log
debug.txt
contig.new.fa

For all the queries, COBRA assigns them to different categories based on their joining status (detailed in the COBRA_joining_status.txt file), i.e.,

"self_circular" - the query contig itself is a circular genome.
"extended_circular" - the query contig was joined and extended into a circular genome.
"extended_partial" - the query contig was joined and extended but not to circular.
"extended_failed" - the query contig was not able to be extended due to COBRA rules.
"orphan end" - neither end of a given contig shares "expected overlap length" with others.

For the joined and extended queries in each category, only the unique ones (*.fasta) will be saved for users' following analyses, and the sequence information (e.g., length, coverage, GC, num of Ns) is summarized in the *fasta.summary.txt files. For categories of "extended_circular", and "extended_partial", the joining details of each query are included in the corresponding folder and *joining_details.txt file, and summarized in the COBRA_joining_summary.txt file, an example shown below:

QuerySeqID      QuerySeqLen     TotRetSeqs      TotRetLen       AssembledLen    ExtendedLen     Status
contig-140_100  47501   3       50379   49962   2461    Extended_circular
contig-140_112  45060   3       62549   62132   17072   Extended_circular
contig-140_114  44829   2       45342   45064   235     Extended_circular
contig-140_160  40329   2       41018   40740   411     Extended_circular
contig-140_188  38386   5       48986   48291   9905    Extended_circular
...

log file: The log file includes the content of each processing step, an example shown below:

1. INPUT INFORMATION
# Assembler: IDBA_UD
# Min-kmer: 20
# Max-kmer: 140
# Overlap length: 139 bp
# Read mapping max mismatches for contig linkage: 2
# Query contigs: file-path
# Whole contig set: file-path
# Mapping file: file-path
# Coverage file: file-path
# Output folder: file-path

2. PROCESSING STEPS
[01/22] [2023/05/04 12:48:15] Reading contigs and getting the contig end sequences. A total of 311739 contigs were imported.
[02/22] [2023/05/04 12:48:33] Getting shared contig ends.
[03/22] [2023/05/04 12:48:40] Writing contig end joining pairs.
[04/22] [2023/05/04 12:48:41] Getting contig coverage information.
[05/22] [2023/05/04 12:48:42] Getting query contig list. A total of 2304 query contigs were imported.
[06/22] [2023/05/04 12:48:43] Getting contig linkage based on sam/bam. Be patient, this may take long.
[07/22] [2023/05/04 13:00:01] Detecting self_circular contigs.
[08/22] [2023/05/04 13:00:37] Detecting joins of contigs. 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100% finished.
[09/22] [2023/05/04 14:29:20] Saving potential joining paths.
[10/22] [2023/05/04 14:29:23] Checking for invalid joining: sharing queries.
[11/22] [2023/05/04 14:29:28] Getting initial joining status of each query contig.
[12/22] [2023/05/04 14:29:28] Getting final joining status of each query contig.
[13/22] [2023/05/04 14:29:28] Getting the joining order of contigs.
[14/22] [2023/05/04 14:29:28] Getting retrieved contigs.
[15/22] [2023/05/04 14:29:32] Saving joined seqeuences.
[16/22] [2023/05/04 14:29:35] Checking for invalid joining using BLASTn: close strains.
[17/22] [2023/05/04 14:30:11] Saving unique sequences of "Extended_circular" and "Extended_partial" for joining checking.
[18/22] [2023/05/04 14:30:12] Getting the joining details of unique "Extended_circular" and "Extended_partial" query contigs.
[19/22] [2023/05/04 14:30:12] Saving joining summary of "Extended_circular" and "Extended_partial" query contigs.
[20/22] [2023/05/04 14:30:15] Saving joining status of all query contigs.
[21/22] [2023/05/04 14:30:15] Saving self_circular contigs.
[22/22] [2023/05/04 14:30:16] Saving the new fasta file.

======================================================================================================================================================
3. RESULTS SUMMARY
# Total queries: 2304
# Category i   - Self_circular: 74
# Category ii  - Extended_circular: 120 (Unique: 82)
# Category ii  - Extended_partial: 1088 (Unique: 889)
# Category ii  - Extended_failed (due to COBRA rules): 245
# Category iii - Orphan end: 777
# Check "COBRA_joining_status.txt" for joining status of each query.
# Check "COBRA_joining_summary.txt" for joining details of "Extended_circular" and "Extended_partial" queries.
======================================================================================================================================================

cobra's People

Contributors

Stargazers

Watchers

Forkers

apcamargo zjyzjjzmt healthvivo liupfskygre hocnonsense peterk87 tarsli tigerxu

cobra's Issues

BLAST error

Thank you very much for your software output, which provides a new workflow for viral group research!

Building a new DB, current time: 03/09/2024 10:19:13
New DB name:   /home/zhongpei/hard_disk_sda2/zhongpei/Virome/rawdata/upload_20230812/zhongpei_analyse/fastp/final_assembly/clean_TPM_Decontam_contig/Vir_result/AF11_metaSPAdes_COBRA/blastdb_1.fa
New DB title:  blastdb_1.fa
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 3000000000B
BLAST options error: File blastdb_1.fa is empty
BLAST Database error: No alias or index file found for nucleotide database [blastdb_1.fa] in search path [/home/zhongpei/hard_disk_sda2/zhongpei/Virome/rawdata/upload_20230812/zhongpei_analyse/fastp/final_assembly/clean_TPM_Decontam_contig/Vir_result/AF11_metaSPAdes_COBRA::]

When I run COBRA, echo gives the following error
But it seems my process is complete. Is this reasonable? And is it a "BLAST error" because my -q sequence is not extended at all?
here is my log file
log.txt
It would be extremely helpful to hear from you!

cobra-meta missing

Hi Dr Chen
After conda activate cobra, we tried to run cobra pipeline, with this error occured as
"cobra-meta" missing
do i miss to install something?

thanks.

Application of COBRA to stool whole metagenome datasets

Hello,

First of all, thank you for developing this great tool!

I recently read your published paper.
I would like to ask about the current recommendations or precautions for applying COBRA to stool whole metagenome datasets.

From what I understand based on the version uploaded to bioRxiv, there was exploration into its use for microbial genomes. However, it seems this part was excluded from the published version during the revision process (I saw it in peer review comments).

Are there any current recommendations for applying COBRA to whole metagenome datasets?
Is performing contig extension with COBRA before binning currently the best practice, as mentioned in the revision comments?
Is it possible to apply COBRA like bewlow workflow (mutiple binning tools followed by binning refinement)

contig extension with COBRA

binning using multiple binner (such as MetaBAT, MaxBin, CONCOCT, Binsanity etc.)

binning refinement (e.g., DAS tool) process.

I would assume performing this process individually for each sample wouldn't raise any problems, but I wonder if you have any experiences or concerns regarding this workflow.

Thank you for taking the time to read my questions.
If there are any misunderstandings on my part, I would appreciate your kind correction.

Regards,

Usage of cobra

Cobra is an interesting software, but I have some questions. Is the input data must be viral metagenome data? I'm not sure if cobra could works for common metagenome data and will cobra produce some bacteria or archaea MAGs? Thanks.

input files

Thanks for this nice tool! Is it possible to add advice in the readme text on how produce coverage.txt and mapping.sam ?

Conda installation

Hi Linxing! I just wanted to recommend adding detailed installation instructions for a conda environment that users could just copy and paste. I know it's silly, but many users are not savvy in python but have conda available, and I think it would extend your user list if you just do this one small thing.

How does COBRA handle joining two query sequences?

Hi Linxing,

Great tool. I'm really interested in using it with some of my datasets. One question that came up in discussing COBRA with others was how it handles cases where e.g. Query Contigs Q1 and Q2 could be joined by non-query contigs, e.g. Q1-c1-c2-Q2. Apologies if I missed this part of the manuscript but does COBRA have the ability to handle this, or does it start from the assumption that it is only looking for ways it can extend query contigs in isolation?

Best,

Luke

About the usage of COBRA

Dear Linxing,

Thank you for providing an excellent tool for viral metagenomic studies. I have a question about the usage of COBRA:

I want to improve the genome quality of viruses recovered from multi-sample coassembled metagenomes (such as A1, A2, A3). In this case, I merged the mapping files of each metagenome (i.e., A1.bam A2.bam A3.bam > merge.bam) as the bam input and calculate the coverage file using jgi_summarize_bam_contig_depths (the first column was contig name, the second to fourth columns were the coverage of A1, A2, A3) as the coverage input. I'm not sure if my inputs to COBRA were correct.

Regards,
Mengzhi

Using a co-assembyl

Hello, thank you for the great tool!
My question is that can the assembly come from several different but similar sources? I have gut microbiome samples from the same organisms but from different sampling times and indivduals and made a coassembly from them.
Thanks, Balazs

Input assemblies for COBRA

Hi,

You mention that COBRA has so far only been tested on assembled contigs from metaSPAdes, IDBA_UD, and MEGAHIT. Would you recommend to use it on viral contigs that have been assembled using single cell mode (--sc) from SPAdes (used in our study for metaviromes)?

Thank you,
Asier

Questions on input fasta files

Thanks for this great tool! However, I have a few questions on input for cobra

Now I'm using megahit to assembly (I just know from #3 that it may generate many chimeric contigs, but spades may run out of memory when assembly environmental samples). In megahit intermediate results, it will provide a file named ./intermediate_contigs/k141.contigs.fa, in which there are many small contigs < 200 bp (which will be filtered in ./final.contigs.fa). My question is, which contig file is more recommended to be used as --fasta FASTA input?
Is it possible to just use the final.contigs.fa as --query QUERY file, or a filtered version with all contigs longer than given length (i.e. 1000 or 2500 bp)? In another words, can cobra be used before virus contigs annotation and MAG binning?

Regards,
hwrn

Cobra prints contig names and numbers without providing any clarification of what they represent

Hi @linxingchen

I've noticed that sometimes COBRA prints contig names and numbers on the screen without a clear explanation of what they represent. It appears that the code in these lines is responsible for this. Could it be leftover debug code?

KeyError for extended_circular_fasta.write(header2joined_seq[contig2extended_status[contig]] + '\n')

Hey LinXing,

First, thanks for developing this super useful tool! I've so far ran it on a couple hundred virome samples and it's working great! I have stumbled across an issue with 2 samples though that I was wondering if you could help me with. For each of the 2 metagenomes I get this error (but obviously a different key value differing the messages):

Traceback (most recent call last):                                                                                                                                            
  File ".snakemake/conda/e7dfc9b82d378646d96b58efd8350561_/bin/cobra-meta", line 12, in <module>                  
    sys.exit(main())                                                                                                                                                          
  File ".snakemake/conda/e7dfc9b82d378646d96b58efd8350561_/lib/python3.10/site-packages/cobra.py", line 1598, in m
ain                                                                                                                                                                           
    extended_circular_fasta.write(header2joined_seq[contig2extended_status[contig]] + '\n')                                                                                   
KeyError: 'NODE_404_length_6556_cov_28.183972'

I checked for these keys in the whole contig set (the scaffold.fasta output by metaspades); query file (all scaffolds >2.5kbp from the metaspades scaffolds.fasta); and the coverage file. It is present in each of those files with no additional characters. So, I'm a bit confused with what the source of the problem could be (especially since the couple of hundred other samples i've processed worked fine!). Any help you could give would be much appreciated!

The the log file looks like this:

1. INPUT INFORMATION
# Assembler: metaSPAdes
# Min-kmer: 21
# Max-kmer: 55
# Overlap length: 55 bp
# Read mapping max mismatches for contig linkage: 2
# Query contigs: /data/san/data0/users/chris/data/processed/assemblies/spades_CTRLM08LongB03Faeces/scaffolds_over_2.5kb.fasta
# Whole contig set: /data/san/data0/users/chris/data/processed/assemblies/spades_CTRLM08LongB03Faeces/scaffolds.fasta
# Mapping file: /data/san/data0/users/chris/data/processed/assemblies/spades_CTRLM08LongB03Faeces/scaffolds_sorted.bam
# Coverage file: /data/san/data0/users/chris/data/processed/assemblies/spades_CTRLM08LongB03Faeces/scaffolds.coverage.formatted.txt
# Output folder: /data/san/data0/users/chris/data/processed/assemblies/spades_CTRLM08LongB03Faeces/cobra

2. PROCESSING STEPS
[01/23] [2024/02/16 00:22:53] Reading contigs and getting the contig end sequences. A total of 255643 contigs were imported.
[02/23] [2024/02/16 00:22:58] Getting shared contig ends.
[03/23] [2024/02/16 00:23:00] Writing contig end joining pairs.
[04/23] [2024/02/16 00:23:00] Getting contig coverage information.
[05/23] [2024/02/16 00:23:00] Getting query contig list. A total of 2314 query contigs were imported.
[06/23] [2024/02/16 00:23:00] Getting contig linkage based on sam/bam. Be patient, this may take long.
[07/23] [2024/02/16 00:23:38] Parsing the linkage information.
[08/23] [2024/02/16 00:23:43] Detecting self_circular contigs.                                                                                                                     "zEQ$" 23:44 15-Feb-24
[09/23] [2024/02/16 00:23:43] Detecting joins of contigs. 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100% finished.
[10/23] [2024/02/16 00:23:46] Saving potential joining paths.
[11/23] [2024/02/16 00:23:46] Checking for invalid joining: sharing queries.
[12/23] [2024/02/16 00:23:47] Getting initial joining status of each query contig.
[13/23] [2024/02/16 00:23:47] Getting final joining status of each query contig.
[14/23] [2024/02/16 00:23:47] Getting the joining order of contigs.
[15/23] [2024/02/16 00:23:47] Getting retrieved contigs.
[16/23] [2024/02/16 00:23:47] Saving joined seqeuences.
[17/23] [2024/02/16 00:23:47] Checking for invalid joining using BLASTn: close strains.
[18/23] [2024/02/16 00:23:48] Saving unique sequences of "Extended_circular" and "Extended_partial" for joining checking.

KeyError with input file

Hello,
I am running into this error:

cobra-meta -f final.contigs.fa -q test.fa -c coverage.txt -m sorted_output.sam -a idba -mink 20 -maxk 140

/home/pck/anaconda3/envs/cobra/lib/python3.8/site-packages/Bio/SeqUtils/init.py:144: BiopythonDeprecationWarning: GC is deprecated; please use gc_fraction instead.
warnings.warn(
Traceback (most recent call last):
File "/home/pck/anaconda3/envs/cobra/bin/cobra-meta", line 10, in
sys.exit(main())
File "/home/pck/anaconda3/envs/cobra/lib/python3.8/site-packages/cobra.py", line 878, in main
if header2len[line.reference_name] > 1000:
KeyError: 'k141_2210 flag=1 multi=2.0000 len=345'

k141 is the very first contig in my input file, the only output file besides the log is COBRA_end_joining_pairs.txt

log:

INPUT INFORMATION

Assembler: IDBA_UD

Min-kmer: 20

Max-kmer: 140

Overlap length: 139 bp

Read mapping max mismatches for contig linkage: 2

Query contigs: /home/pck/metagenome_seq/megahit_output/test.fa

Whole contig set: /home/pck/metagenome_seq/megahit_output/final.contigs.fa

Mapping file: /home/pck/metagenome_seq/megahit_output/sorted_output.sam

Coverage file: /home/pck/metagenome_seq/megahit_output/coverage.txt

Output folder: /home/pck/metagenome_seq/megahit_output/test.fa_COBRA

PROCESSING STEPS
[01/23] [2024/02/09 13:12:51] Reading contigs and getting the contig end sequences. A total of 3468 contigs were imported.
[02/23] [2024/02/09 13:12:51] Getting shared contig ends.
[03/23] [2024/02/09 13:12:51] Writing contig end joining pairs.
[04/23] [2024/02/09 13:12:51] Getting contig coverage information.
[05/23] [2024/02/09 13:12:51] Getting query contig list. A total of 1 query contigs were imported.
[06/23] [2024/02/09 13:12:51] Getting contig linkage based on sam/bam. Be patient, this may take long.

BLAST options error: File blastdb_1.fa is empty

Hi there,

I'm excited about the potential this tool offers!

Currently, I'm receiving an error in the python command about the BLAST db COBRA needs:

$ python cobra.py 
-f final.contigs.renamed.fa /
-q samp_471_concoct_280.fa /
-c coverage_tail.txt /
-m final.contigs.renamed.sam /
-a megahit /
-mink 21 /
-maxk 99

Command line error:

Building a new DB, current time: 06/30/2023 13:24:18
New DB name:  samp_471_concoct_280.fa_COBRA/blastdb_1.fa
New DB title:  blastdb_1.fa
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
BLAST options error: File blastdb_1.fa is empty
BLAST Database error: No alias or index file found for nucleotide database [blastdb_1.fa] in search path [samp_471_concoct_280.fa_COBRA::/reference-data/blast:]

Any guidance would be greatly appreciated.

Tiny dataset is available ?

Hi.
This is great tool.
I would like to use cobra for reconstructing complete virome genome-set from my gut metagenomic reads.
Is it possible to provide some small test data (OR mock) on the repo. ?
I am hoping to confirm that it works successfully in my enviroment (ubuntu20, python3.10).

Best,

Missing `COBRA_retrieved_for_joining` contig file

Hello Cobra authors!

Thanks for building this tool.

I keep getting this error of a contig file not being created.

Traceback (most recent call last):
  File "/home/mall0133/miniconda3/envs/cobra/bin/cobra-meta", line 10, in <module>
    sys.exit(main())
  File "/home/mall0133/miniconda3/envs/cobra/lib/python3.8/site-packages/cobra.py", line 1747, in main
    '\t'.join([contig, str(header2len[contig]), summarize(contig), query2current[contig]]) + '\n')
  File "/home/mall0133/miniconda3/envs/cobra/lib/python3.8/site-packages/cobra.py", line 575, in summarize
    b = count_seq('COBRA_retrieved_for_joining/{0}_retrieved.fa'.format(item))  # number of retrieved contigs
  File "/home/mall0133/miniconda3/envs/cobra/lib/python3.8/site-packages/cobra.py", line 482, in count_seq
    a = open(fasta_file, 'r')
FileNotFoundError: [Errno 2] No such file or directory: 'COBRA_retrieved_for_joining/NODE_1990_length_17225_cov_13.396421_retrieved.fa'

Every time I try re-running, it fails at this stage with a FileNotFoundError for a different contig.

This is my Cobra command. I'm using the latest version (version 1.2.3).

cobra-meta -f contigs.fasta -q query_contigs.fasta -c coverage.tsv -m sorted_reads.bam -a metaspades -mink 21 -maxk 127

My query_contigs.fasta file contains about 2300 contig sequences. I've also attached the log file of the run.
log.txt

Any advice on how to fix this error and get Cobra running will be appreciated.

Thanks!

Can I combine COBRA with binning tools?

Hi, @linxingchen
According to issues #2, #22, and #27, I think maybe I can extend the virus contigs with COBRA then binning the new contigs with binner tools like vRhyme to get virus genomes?

Questions about the application of COBRA

Dear Linxing,

Good afternoon. Thank you for this excellent paper and tool! I read this paper and shared it with my lab members. I have some questions about the application of COBRA:

For the metaSPAde you mentioned in the README file and paper, have you tested the MetaviralSPAdes(https://academic.oup.com/bioinformatics/article-abstract/36/14/4126/5837667) which is the specific viral mode for assembly? I'm working on some metagenomic data and trying to find the viral sequence in it. I think it would be a little different if we use metaSPAde and then identify viral contigs by other tools like VIBRANT or directly identify viral contigs based on MetaviralSPAdes.
I think it could be a similar question with closed#22.
We're working on the Mitochondria sequence also, and do you think COBRA could be used into others like mitochondria sequences, to re-assemble those sequences, using reference genome sequence as input1 and assembled results as input2? Or we have to use those inputs from the same data?

Thank you for your time and patience!

Best Regards,
Xinpeng

Error about blastn

Hi Dr Chen：
I try to run COBRA today with

source activate cobra
cobra-meta -f Megahit_out/611PES1/611PES1_megahit.contigs.fa -q checkv_in/611RW.fa -o COBRA/611_RW_MV_COBRA -c coverage/611RW_coverage.txt -m mapping/611RW-MV.sam -a megahit -mink 21 -maxk 141
conda deactivate

and it works at beginning.
But it stop at cobra.py, line 1531:r = open('blastdb_2.vs.blastdb_1', 'r')
The error is

makeblastdb: error while loading shared libraries: libzstd.so.1: cannot open shared object file: No such file or directory
blastn: error while loading shared libraries: libzstd.so.1: cannot open shared object file: No such file or directory
Traceback (most recent call last):
  File "/data01nfs/user/songq/anaconda3/envs/cobra/bin/cobra-meta", line 10, in <module>
    sys.exit(main())
  File "/data01nfs/user/songq/anaconda3/envs/cobra/lib/python3.8/site-packages/cobra.py", line 1531, in main
    r = open('blastdb_2.vs.blastdb_1', 'r')
FileNotFoundError: [Errno 2] No such file or directory: 'blastdb_2.vs.blastdb_1'

blastdb_1.fa and blastdb_2.fa are written and not empty, and I also install BLAST in cobra env, so I guess The error was not caused by BLAST.
Please help me solve this problem, I am willing to provide you with any other information you need.
Thanks

multiple bam files

Hi Dr. Chen

Thank you for introducing this awesome tool!
What should I do if I have multiple bam files for the assembly built from multiple samples?
Thanks!

new installation of miniconda I run into this error

ImportError Traceback (most recent call last)
Cell In[23], line 4
2 path='show'
3 test_path=("data1/densiflorus (plastid)- MT740250.1 retained.fasta")
----> 4 BCAWT_auto_test.auto_test(path,test_path)
5 BCAWT_auto_test.auto_check_file(path)

File ~/miniconda3/lib/python3.11/site-packages/BCAWT/BCAWT_auto_test.py:24, in auto_test(path, test_file)
21 from BCAWT import BCAWT
22 file = open(test_file, "r")
---> 24 BCAWT.BCAW(main_fasta_file = [file] , save_path= path, Auto=True)

File ~/miniconda3/lib/python3.11/site-packages/BCAWT/BCAWT.py:34, in BCAW(main_fasta_file, save_path, ref_fasta_file, genetic_code_, Auto)
32 #from Bio.Alphabet import IUPAC
33 from Bio.Seq import Seq
---> 34 from Bio.SeqUtils import GC
35 from Bio.Data import CodonTable
36 #from Bio.Alphabet import generic_dna

ImportError: cannot import name 'GC' from 'Bio.SeqUtils' (/home/u1/miniconda3/lib/python3.11/site-packages/Bio/SeqUtils/init.py)

Question on Input for cobra

Thanks for this great tool! However, I have a few questions on input for cobra

Now I'm using megahit to assembly (I just know from #3 that it may generate many chimeric contigs, but spades may run out of memory when assembly environmental samples). In megahit intermediate results, it will provide a file named ./intermediate_contigs/k141.contigs.fa, in which there are many small contigs < 200 bp (which will be filtered in ./final.contigs.fa). My question is, which contig file is more recommended to be used as --fasta FASTA input?
Is it possible to just use the final.contigs.fa as --query QUERY file, or a filtered version with all contigs longer than given length (i.e. 1000 or 2500 bp)? In another words, can cobra be used before virus contigs annotation and MAG binning?

Regards,
hwrn

Feasibility of COBRA for Metatranscriptomic Data

Hi Xinglin,

Thank you for your outstanding paper and tool!

I'm looking to identify RNA viruses in the metatranscriptome. After using Megahit for assembly, I'm curious if COBRA could be used to improve the quality of contigs.

Best,
Shujie

ImportError: cannot import name 'GC' from 'Bio.SeqUtils'

On a new installation of cobra I run into this error:

❯ cobra-meta -h                                                                                  (cobra)
Traceback (most recent call last):
  File "/lila/home/funnellt/mambaforge/envs/cobra/bin/cobra-meta", line 6, in <module>
    from cobra import main
  File "/lila/home/funnellt/mambaforge/envs/cobra/lib/python3.8/site-packages/cobra.py", line 13, in <module>
    from Bio.SeqUtils import GC
ImportError: cannot import name 'GC' from 'Bio.SeqUtils' (/lila/home/funnellt/mambaforge/envs/cobra/lib/python3.8/site-packages/Bio/SeqUtils/__init__.py)

I installed cobra using the instructions in the readme, which installed Biopython 1.82

❯ python                                                                                         (cobra)
Python 3.8.18 | packaged by conda-forge | (default, Dec 23 2023, 17:21:28)
[GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import Bio
>>> Bio.__version__
'1.82'

Looking through the Biopython NEWS.rst file, it looks like the GC module has been deprecated
https://github.com/biopython/biopython/blob/master/NEWS.rst

The role of cobra in classic virome pipeline

Hi Linxing:
Thank you for developing this nice tool! I am curious about what's the functions does COBRA have in the classic virome pipeline as it's a new software.
For example, I used MEGAHIT to get contigs from metagenome reads, and then I used geNomad to identified the viral contigs from the total contigs, and then if I use COBRA in the follow step (i.e. put the geNomad results:<prefix>_summary/<prefix>_virus.fna as the input of COBRA) , does COBRA help me to bin the identified viral contigs together to get a higher completeness here? Can it work before or after geNomad well? I read about that COBRA can identify more circular viral genome and huge phage. Can I consider COBRA as a binner tool or a circular/huge phages identifier?
Thanks!

Randomly not getting full output

Hello,

I have noticed that with some COBRA runs, COBRA randomly aborts after "Saving joining summary of "Extended_circular" and "Extended_partial" query contigs". The log files just end there, and the fasta files in question are missing. Everything else is there. Instead of the missing self circular files, I get what appears to be blast databases (blastdb_1 and blastdb_2) as output.

Rerunning the metagenomic analysis on the exact same original fastq files, starting with megahit assembly, it randomly works fine. All output files up until running coverage_transfer.py are identical (with slight contig naming differences because of megahit), I can not tell a difference in runs up until COBRA comes in the picture, and even log and debug files are identical up until the "Saving joining summary of "Extended_circular" and "Extended_partial" query contigs"

Very strange bug, and not sure how to figure out what's wrong since it appears random. Happy to provide successful/unsuccessful run data, just not sure what would help and don't want to dump it all here.

Docker/Singularity

Hi,
Do you have any plan to release a Docker/Singularity version of COBRA?
Thanks

Is it suitable for refinning bins

I read the paper. In discussion you mentioned that "COBRA can also be applied to microbial genomes and whole metagenonmes" and "using COBRA only to extend the subset of longer contigs". How about to use COBRA as a binning refiner? Although some refiner has been reported, such as https://github.com/dparks1134/RefineM and https://apcamargo.github.io/magpurify2, purifing bins is difficult in extremely complex microbial community. Maybe COBRA have more performance as a binning refiner.

biopython issue / GC deprecation

Thank you for this amazing tool!

I installed cobra using micromamba. biopython=1.81 is installed but the GC deprecation issue still ensued.

$ cobra-meta -f pelv_comb_mghit_fixed.fa -q bin.10.fa -c pelv.comb.cov.txt -m sorted_pelv_comb.bam -a megahit -mink 21 -maxk 141 -t 8

/Users/andriangajigan/mambaforge/envs/cobra/lib/python3.8/site-packages/Bio/SeqUtils/init.py:144: BiopythonDeprecationWarning: GC is deprecated; please use gc_fraction instead.
warnings.warn(
Traceback (most recent call last):
File "/Users/andriangajigan/mambaforge/envs/cobra/bin/cobra-meta", line 10, in
sys.exit(main())
File "/Users/andriangajigan/mambaforge/envs/cobra/lib/python3.8/site-packages/cobra.py", line 818, in main
cov[line[0]] = round(float(line[1]), 3)
ValueError: could not convert string to float: 'contigLen

$ conda list
packages in environment at /Users/andriangajigan/mambaforge/envs/cobra:

Name Version Build Channel
biopython 1.81 py38hae2e43d_1 conda-forge
blast 2.15.0 pl5321h91c44f7_1 bioconda
bzip2 1.0.8 h10d778d_5 conda-forge
c-ares 1.26.0 h10d778d_0 conda-forge
ca-certificates 2024.2.2 h8857fd0_0 conda-forge
cobra-meta 1.2.2 pyhdfd78af_0 bioconda
curl 8.5.0 h726d00d_0 conda-forge
entrez-direct 16.2 h193322a_1 bioconda
gettext 0.21.1 h8a4c099_0 conda-forge
krb5 1.21.2 hb884880_0 conda-forge
libblas 3.9.0 21_osx64_openblas conda-forge
libcblas 3.9.0 21_osx64_openblas conda-forge
libcurl 8.5.0 h726d00d_0 conda-forge
libcxx 16.0.6 hd57cbcb_0 conda-forge
libdeflate 1.18 hac1461d_0 conda-forge
libedit 3.1.20191231 h0678c8f_2 conda-forge
libev 4.33 h10d778d_2 conda-forge
libffi 3.4.2 h0d85af4_5 conda-forge
libgfortran 5.0.0 13_2_0_h97931a8_2 conda-forge
libgfortran5 13.2.0 h2873a65_2 conda-forge
libiconv 1.17 hd75f5a5_2 conda-forge
libidn2 2.3.7 h10d778d_0 conda-forge
liblapack 3.9.0 21_osx64_openblas conda-forge
libnghttp2 1.58.0 h64cf6d3_1 conda-forge
libopenblas 0.3.26 openmp_hfef2a42_0 conda-forge
libsqlite 3.44.2 h92b6c6a_0 conda-forge
libssh2 1.11.0 hd019ec5_0 conda-forge
libunistring 0.9.10 h0d85af4_0 conda-forge
libzlib 1.2.13 h8a1eda9_5 conda-forge
llvm-openmp 17.0.6 hb6ac08f_0 conda-forge
ncbi-vdb 3.0.0 pl5321h9722bc1_0 bioconda
ncurses 6.4 h93d8f39_2 conda-forge
numpy 1.24.4 py38h9a4a08f_0 conda-forge
openssl 3.2.1 hd75f5a5_0 conda-forge
pcre 8.45 he49afe7_0 conda-forge
perl 5.32.1 7_h10d778d_perl5 conda-forge
perl-archive-tar 2.40 pl5321hdfd78af_0 bioconda
perl-carp 1.50 pl5321hd8ed1ab_0 conda-forge
perl-common-sense 3.75 pl5321hd8ed1ab_0 conda-forge
perl-compress-raw-bzip2 2.201 pl5321h775f41a_0 conda-forge
perl-compress-raw-zlib 2.202 pl5321h775f41a_0 conda-forge
perl-encode 3.19 pl5321hb7f2c08_0 conda-forge
perl-exporter 5.74 pl5321hd8ed1ab_0 conda-forge
perl-exporter-tiny 1.002002 pl5321hd8ed1ab_0 conda-forge
perl-extutils-makemaker 7.70 pl5321hd8ed1ab_0 conda-forge
perl-io-compress 2.201 pl5321h7133b54_2 bioconda
perl-io-zlib 1.14 pl5321hdfd78af_0 bioconda
perl-json 4.10 pl5321hdfd78af_0 bioconda
perl-json-xs 2.34 pl5321hcd10b59_5 bioconda
perl-list-moreutils 0.430 pl5321hdfd78af_0 bioconda

Thanks in advance!

Using multiple thread/cores?

Hi.

Thank you for making this amazing tool.

I am currently using this tool to improve my metagenomic viral contigs obtained from MegaHIT.

So, I have 80 shotgun metagenomic samples which I did coassembly in MegaHIT followed by viral contig detection by VirSorter2.

I want to extend my viral contigs by COBRA, but it not showing any progress for the past 6 days.

This is the code I use (I am using my institute cluster)

python cobra.py -f ~/final.contigs.fa -q ~/final.contigs.phages_combined.fna \
-o VirSorter -c ~/CoverM/For_virus/Coverage.txt -m ~/MegaHit_final.contigs.fa.All.sorted.bam \
-a megahit -mink 21 -maxk 77

This is the logfile

1. INPUT INFORMATION
# Assembler: MEGAHIT
# Min-kmer: 21
# Max-kmer: 77
# Overlap length: 77 bp
# Read mapping max mismatches for contig linkage: 2
# Query contigs: /lustre7/home/bhimbiswa/MAGs/Virus/COBRA/final-viral-combined.fa
# Whole contig set: /lustre7/home/bhimbiswa/MAGs/MegaHit/output_folder/final.contigs.fa
# Mapping file: /lustre7/home/bhimbiswa/MAGs/CoverM/For_virus/bam_files/MegaHit_final.contigs.fa.All.sorted.bam
# Coverage file: /lustre7/home/bhimbiswa/MAGs/CoverM/For_virus/Coverage.txt
# Output folder: /lustre7/home/bhimbiswa/MAGs/Virus/COBRA/Virsorter2

2. PROCESSING STEPS
[01/22] [2023/06/07 14:43:28] Reading contigs and getting the contig end sequences. A total of 1259677 contigs were imported.
[02/22] [2023/06/07 14:44:13] Getting shared contig ends.
[03/22] [2023/06/07 14:44:34] Writing contig end joining pairs.
[04/22] [2023/06/07 14:44:35] Getting contig coverage information.
[05/22] [2023/06/07 14:44:39] Getting query contig list. A total of 47249 query contigs were imported.
[06/22] [2023/06/07 14:44:40] Getting contig linkage based on sam/bam. Be patient, this may take long.
[07/22] [2023/06/08 08:29:30] Detecting self_circular contigs. 
[08/22] [2023/06/08 09:14:19] Detecting joins of contigs.

I can understand that I have lots of query files that why this is very slow, but is there any way I can make this faster?

P.S.: COBRA worked fine with Phigaro viral contigs (only 285 query contigs).

DTRs and ITRs

Hi.

After your suggestion last time, I reassembled my data with metaSpades (single assembly) using (-k 21,33,55,77,99,127).

From the output assembly, I filtered viral contigs using Vibrant (1983 putative phages were identified).

From here I fed these predicted phages into vRhyme and COBRA

This is the result of vRhyme for circular contigs.

Scaffold	type	mismatches	length	repeat
NODE_31851_length_2369_cov_1.518733	DTR	0	127	CCTCCATCAAGATGAATACAACTCCATTTACCAAGATTGTTGGCAAAATAGCTGTCATCCTCGGCAAAATTGAAACTGTCATGCACAAAGACAGCTACGCGCGTCACAGTAATGCGCCAGCGCCCAT
NODE_18158_length_3813_cov_11.723820	DTR	0	127	TAGCCGGTTGCTCTGGCACCTCTCGTACCGGCTCCTGCTTCACCCGCAGAGGGTTGATGTCCCGGCCCGAAAGGATGTAGTCCCCCGCGCCGAATCGCCAGCGGAAGGTAACCGCCAGTTCCGCCAC
NODE_14185_length_4746_cov_1.590171	DTR	0	127	ACTTCTGTTTAAATTCTTCTTTGTTAATATTTGTGTACCAGCCCCATCCATTGTAATCATGTCCGCCTGTCAAATGGTATGCTCTTTTTAATGAAGTATAATTCCCTGAAAGATGGACGCCTGTAAT
NODE_421_length_73347_cov_5.589907	ITR	0	116	GGATTGCCACCAGATATGACAAATTGGATACGTCTTTTCTCGCCTTCGTTTATATTGCCGCTATCGCTATTTTGTTAATTTAGCTCATCCCTCCTTATTTTTCAAACAACCCCTAA
NODE_4030_length_14370_cov_3.630134	DTR	0	40	ATAGAAAATAGAATAAAAAACTCATTTGTCACATGAAAAA
NODE_18745_length_3708_cov_3.479754	ITR	1	111	AAATATTTCCGATATCTATCAAAACAAAAGCGGTGGCGAATCAATCAAAACGAAGCGTTTCATACAGCAATTTTACCAGGGCTTTTGTCTTAAGTTCGCGCCTATGGGGCT

COBRA identified only 2 circular contigs

SeqID	Length	Coverage	GC	Ns	DTR_length	
NODE_31851_length_2369_cov_1.518733_self_circular	2369	10.331	57.957	0	127	
NODE_18158_length_3813_cov_11.723820_self_circular	3813	77.942	57.383	0	127

COBRA even extended one circular contig as "COBRA_category_ii-b_extended_partial_unique"

SeqID	Length	Coverage	GC	Ns	
NODE_421_length_73347_cov_5.589907_extended_partial	88998	38.204	57.323	0

Rest are "COBRA_category_ii-c_extended_failed"

SeqID	Length	Coverage	GC	Ns
NODE_4030_length_14370_cov_3.630134	14370	24.458	53.598	0	
NODE_18745_length_3708_cov_3.479754	3708	22.521	53.479	0

and "COBRA_category_iii_orphan_end"

SeqID	Length	Coverage	GC	Ns
NODE_14185_length_4746_cov_1.590171	4746	10.854	35.04	0

So, here are my questions

Is it possible for COBRA to identify all DTRs and ITRs?
As COBRA extended one of the vRhyme predicted circular contig, how should I interpret COBRA result? Is it correct to be extended by COBRA?
VIBRANT output is sometimes fragmented contigs whose name is not present in my assembly. These names are not recognised by COBRA. Is there any way for these fragments to be included? (In this sample there are 73 query)

Query NODE_464_length_69899_cov_2.489623_fragment_1 is not in your whole contig fasta file, please check!