soedinglab / plass Goto Github PK

sensitive and precise assembly of short sequencing reads

License: GNU General Public License v3.0

CMake 0.70% Shell 2.00% C++ 30.25% C 64.77% Dockerfile 0.05% Perl 0.01% R 0.01% Batchfile 0.04% Makefile 0.65% Python 0.46% Meson 0.19% Lua 0.01% Starlark 0.08% HTML 0.62% Roff 0.16%

bioinformatics metagenomics sequence-assembler proteins opensource proteomics metatranscriptomics

plass's Introduction

PLASS and PenguiN assembler

Plass (Protein-Level ASSembler) and PenguiN (Protein guided nucleotide assembler) are software to assemble protein sequences or DNA/RNA contigs from short read sequencing data meant to work best for complex metagenomic or metatranscriptomic datasets. Plass and Penguin are GPL-licensed open source software implemented in C++ and available for Linux and macOS and are designed to run on multiple cores.

Plass: Steinegger M, Mirdita M and Soeding J. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nature Methods, doi: doi.org/10.1038/s41592-019-0437-4 (2019).

PenguiN: Jochheim A, Jochheim FA, Kolodyazhnaya A, Morice E, Steinegger M, Soeding J. Strain-resolved de-novo metagenomic assembly of viral genomes and microbial 16S rRNAs. bioRxiv (2024)

Soil Reference Catalog (SRC) and Marine Eukaryotic Reference Catalog (MERC)

SRC was created by assembling 640 soil metagenome samples. MERC was assembled from the the metatranscriptomics datasets created by the TARA ocean expedition. Both catalogues were redundancy reduced to 90% sequence identity at 90% coverage. Each catalog is a single FASTA file containing the sequences, the header identifiers contain the Sequence Read Archive (SRA) identifiers. The catalogues can be downloaded here. We provide a HH-suite3 database called "BFD" containing sequences from the Metaclust, SRC, MERC and Uniport at here.

PenguiN - Protein-guided Nucleotide assembler

PenguiN a software to assemble short read sequencing data on a nucleotide level. In a first step it assembles coding sequences using the information from the translated protein sequences. In a second step it links them across non-coding regions. The main purpose of PenguiN is the assembly of complex metagenomic and metatranscriptomic datasets. It was especially tested for the assembly of viral genomes as well as 16S rRNA gene sequences. It assembles 3-40 times more complete viral genomes and six times as many 16S rRNA sequences than state of the art assemblers like Megahit and the SPAdes variants.

Install Plass and PenguiN

Our software can be install via conda or as statically compiled binaries. It requires a 64-bit Linux or macOS system.

 # install from bioconda
 conda install -c conda-forge -c bioconda plass 
 # install docker
 docker pull ghcr.io/soedinglab/plass:latest
 # static build with AVX2 (fastest)
 wget https://mmseqs.com/plass/plass-linux-avx2.tar.gz; tar xvfz plass-linux-avx2.tar.gz; export PATH=$(pwd)/plass/bin/:$PATH
 # static build with SSE4.1
 wget https://mmseqs.com/plass/plass-linux-sse41.tar.gz; tar xvfz plass-linux-sse41.tar.gz; export PATH=$(pwd)/plass/bin/:$PATH
 # universal build with macOS (Intel or Apple Silicon)
 wget https://mmseqs.com/plass/plass-osx-universal.tar.gz; tar xvfz plass-osx-universal.tar.gz; export PATH=$(pwd)/plass/bin/:$PATH

Other precompiled binaries for SSE2, ARM and PowerPC can be found at mmseqs.com/plass.

How to assemble

Plass and PenguiN can assemble both paired-end reads (FASTQ) and single reads (FASTA or FASTQ):

  # assemble paired-end reads 
  plass assemble examples/reads_1.fastq.gz examples/reads_2.fastq.gz assembly.fas tmp

  # assemble single-end reads 
  plass assemble examples/reads_1.fastq.gz assembly.fas tmp

  # assemble single-end reads using stdin
  cat examples/reads_1.fastq.gz | plass assemble stdin assembly.fas tmp

Important parameters:

 --min-seq-id         Adjusts the overlap sequence identity threshold
 --min-length         minimum codon length for ORF prediction (default: 40)
 -e                   E-value threshold for overlaps 
 --num-iterations     Number of iterations of assembly
 --filter-proteins    Switches the neural network protein filter off/on

Plass workflows:

  plass assemble      Assembles proteins (i:Nucleotides -> o:Proteins)

PenguiN workflows:

  penguin guided_nuclassemble  Assembles nucleotides using protein and nucleotide information (i:Nucleotides -> o:Nucleotides)
  penguin nuclassemble         Assembles nucleotides using only nucleotdie information (i:Nucleotides -> o:Nucleotides)

Assemble using MPI

Both tools can be distributed over several homogeneous computers. However the tmp folder has to be shared between all nodes (e.g. NFS). The following command assembles on several nodes:

RUNNER="mpirun -np 42" plass assemble examples/reads_1.fastq.gz examples/reads_2.fastq.gz assembly.fas tmp

Compile from source

Compiling from source has the advantage that it will be optimized to the specific system, which should improve its performance. To compile git, g++ (4.9 or higher) and cmake (3.0 or higher) are required. Afterwards, the PLASS and PenguiN binaries will be located in the build/bin directory.

  git clone https://github.com/soedinglab/plass.git
  cd plass
  git submodule update --init
  mkdir build && cd build
  cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. ..
  make -j 4 && make install
  export PATH="$(pwd)/bin/:$PATH"

❗ If you want to compile PLASS or PenguiN on macOS, please install and use gcc from Homebrew. The default macOS clang compiler does not support OpenMP and PLASS will not be able to run multithreaded. Use the following cmake call:

  CXX="$(brew --prefix)/bin/g++-13" cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. ..

Dependencies

When compiling from source, our sofwtare requires the zlib and bzip installed.

Use the docker image

We also provide a Docker image of Plass. You can mount the current directory containing the reads to be assembled and run plass with the following command:

  docker run -ti --rm -v "$(pwd):/app" -w /app ghcr.io/soedinglab/plass:latest assemble reads_1.fastq reads_2.fastq assembly.fas tmp

Hardware requirements

Plass needs roughly 1 byte of memory per residue to work efficiently. Plass will scale its memory consumption based on the available main memory of the machine. Plass needs a CPU with at least the SSE4.1 instruction set to run.

Known problems

The assembly of Plass includes all ORFs having a start and end codon that includes even very short ORFs < 60 amino acids. Many of these short ORFs are spurious since our neural network cannot distingue them well. We would recommend to use other method to verify the coding potential of these. Assemblies above 100 amino acids are mostly genuine protein sequences.
Plass in default searches for ORFs of 40 amino acids or longer. This limits the read length to > 120. To assemble this protein, you need to lower the --min-length threshold. Be aware using short reads (< 100 length) might result in lower sensitivity.

plass's People

Contributors

Stargazers

Watchers

Forkers

pythseq goodstudychina asad sailfish009 stogqy annseidel genomewalker mariia-zelenskaia alienzj bubu227 wook2014

plass's Issues

Compilation error: zlib found but not working

Expected Behavior

Compiling the code using cmake

Current Behavior

Not compiling successfully

stdin:

$ cmake -DCMAKE_BUILD_TYPE=RELEASE -DZLIB_LIBRARY=/home/flejzerowicz/softs/zlib-1.2.11/lib -DZLIB_INCLUDE_DIR=/home/flejzerowicz/softs/zlib-1.2.11/include -DCMAKE_INSTALL_PREFIX=. ..

stdout:

-- Source Directory: /home/flejzerowicz/softs/plass/lib/mmseqs
-- Project Directory: /home/flejzerowicz/softs/plass/lib/mmseqs
-- Compiler is GNU 
-- ZSTD VERSION 1.3.8
-- ShellCheck not found
-- Using CPU native flags for SSE optimization:  -march=native
-- Found AVX2 extensions, using flags:  -march=native -mavx2 -mfpmath=sse -Wa,-q
-- Found ZLIB
-- ZLIB does not work
-- Found BZLIB
-- BZLIB does not work
-- Found OpenMP
-- ShellCheck not found
-- Configuring done
-- Generating done
-- Build files have been written to: /home/flejzerowicz/softs/plass/build
flejzerowicz@barnacle:~/softs/plass/build$ make -j 4
[  1%] Built target ksw2
[  2%] Built target cacode
[  7%] Built target alp
[ 20%] Built target tinyexpr
[ 21%] Built target generated
[ 24%] Built target local-generated
[ 25%] Built target version
[ 36%] Built target libzstd_static
[ 38%] Building CXX object lib/mmseqs/src/CMakeFiles/mmseqs-framework.dir/util/convertmsa.cpp.o
[ 38%] Building CXX object lib/mmseqs/src/CMakeFiles/mmseqs-framework.dir/util/view.cpp.o
[ 38%] Building CXX object lib/mmseqs/src/CMakeFiles/mmseqs-framework.dir/util/createtsv.cpp.o
[ 38%] Building CXX object lib/mmseqs/src/CMakeFiles/mmseqs-framework.dir/util/createsubdb.cpp.o
In file included from /home/flejzerowicz/softs/plass/lib/mmseqs/src/util/convertmsa.cpp:5:0:
/home/flejzerowicz/softs/plass/lib/mmseqs/lib/gzstream/gzstream.h:31:10: fatal error: zlib.h: No such file or directory
 #include <zlib.h>
          ^~~~~~~~
compilation terminated.
make[2]: *** [lib/mmseqs/src/CMakeFiles/mmseqs-framework.dir/util/convertmsa.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [lib/mmseqs/src/CMakeFiles/mmseqs-framework.dir/all] Error 2
make: *** [all] Error 2

Context

Providing context helps us come up with a solution and improve our documentation for the future.

zlib is installed in non standard location but it is found anyways, as well as running cmake without the flags
-DZLIB_LIBRARY=/home/flejzerowicz/softs/zlib-1.2.11/lib -DZLIB_INCLUDE_DIR=/home/flejzerowicz/softs/zlib-1.2.11/include

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Git commit used: NA (because not installed successfully)
git clone https://github.com/soedinglab/plass.git
was run on July 25th 2019
For self-compiled and Homebrew: Cmake versions used: cmake 3.7.2
availble on the server through:
module load cmake_3.7.2
Operating system and version:

uname -a

Linux barnacle.ucsd.edu 2.6.32-504.el6.x86_64 #1 SMP Wed Oct 15 04:27:16 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Use PLASS in metatranscriptomic data

I have 2x150 bp metatranscriptomic reads (prokaryotic) and I'd like to use PLASS to assemble proteins. Should it be used the same way as for a metagenome? In this case most reads should be translated directly into their protein sequence, and containing a start or a stop codon does not seem so crucial to evaluate if the read belongs to a gene or not, as they will come from genes anyway. Perhaps the beginning of a gene in this case should consider the Shine-Dalgarno sequence + start codon (or an upstream stop codon if the gene is inside a polycistron). What would be the best way to apply PLASS to metatranscriptomes? Should I start by translating whole reads in all frames?

Thank you

Output FASTA header format?

Assembled protein sequences contain additional information, e.g.

[Orf: 39, 242, 18446744073709551615, 1, 1]

What is the meaning of these numbers? Is there any coverage information
included? (If not, can it be added?)

Thanks.

Alternative codon table

I recently found this tool and I'll try as soon as possible.
I'm trying to reconstruct a mitochondrial genome without closely related specie, so I think this approach could be a better way to do it, but I'm concern about the codon table.
There is any chance to use the codon 5 table (invertebrate)?

High level of duplicated protein sequences

Hi,
I am using PLASS (v4.687d7) on a set of metagenomes from ~100 cheese samples and it works very well, but still, I have some questions.
In each dataset a high level of protein sequences (on average 30%) are duplicated (with 100% identity and coverage). I understand that some sequences could be duplicated (originating from closely related species), but 30% seems to be quite high.
Another issue is the total amount of assembled amino acid. As an example, for an initial dataset of 18 million reads (2x150 bp paired-end reads, 2.7 Gbp in total), 7 million proteins are assembled (2e+9 aa in total, almost as much as the total amount of nucleotides, which means, to me, more amino acid than expected...).
Is there an explanation about these results ?

I am using PLASS with the following command (others parameters as default):
plass assemble METAG_R1.fastq.gz METAG_R2.fastq.gz METAG_out.fasta -e 0.001 --num-iterations 12 --filter-proteins 1 --remove-tmp-files 1

Thanks
Helene

general question to gauge dev opinion/advice on selecting proteins for gene phylogenies

Hi all, thanks for the awesome tool! I have been using it to boost the recovery of specific protein families from metagenomes. Using this tool, I found an increased number of antibiotic resistance genes (ARGs), many of which appear to be genuine variants that were not detected by regular assembly. My intent is to select proteins for phylogenetic analysis to profile their distribution in the environment vs. clinical isolates.

However, I can imagine chimeras and spurious substitutions are an issue. I increased the minimum identity to 97 at first (and now am retrying at min identity = 100). Would you have any other words of wisdom, caution, or advice for using this tool for this purpose?

Thank you!

Connor

Use Plass for euk metagenomics data

I want to extract euk genes/proteins from metagenomics data. I want to build a gene/protein catalog for euk genes.
Seems that metaeuk is a reference guided approach ( based on mmseq2) and Plass is a denova approach ( not relying on reference protein sequences).
I don't understand the statement in your paper about Plass on euk protein assembly.
"Our chief limitation is that, unlike nucleotide assemblers, Plass cannot place the assembled protein sequences into genomic context. Furthermore, it cannot assemble intron-containing eukaryotic proteins, although, as shown, it can assemble eukaryotic proteins from transcriptome data. Another drawback is its inability to resolve homologous proteins from closely related strains or species with sequence identities above ~95%. However, the impact on the accuracy of predicted functions is low (Fig. 2) and bacterial phenotypes are determined more by the complement of horizontally acquired accessory genes than by minor variations in protein sequences."
I understand the methods behind the mmseq2 and Plass are different.... but mmseq2 should able to handle the 'intron-containing eukaryotic proteins' ...
Anyway,,, could you kindly suggest a good way to identify those euk proteins?? ( the prediction of euk genes from binned euk genomes are so troublesome...)

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.

Plass Output (for bugs)

Please make sure to also post the complete output of Plass. You can use gist.github.com for large output.

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Git commit used (The string after "Plass Version:" when you execute Plass without any parameters):
Which Plass version was used (Statically-compiled, self-compiled, Homebrew, etc.):
For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation:
Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
Operating system and version:

Assembling big data

Hey,
I have a big dataset (>600M paired-end reads) and I am trying to generate a protein catalog using Plass. I am using the version 2.c7e35 in a server with 900Gb ram. The processing is ending without completion due to exceeding the resources requested. I am wondering if it is possible to tweak the parameters to allocate less memory.
Any input will be greatly appreciated.
Thanks,
Livia

Running plass in parallel on same 'tmp' folder

I'm invoking plass in the typical form on our high-throughput computing (slurm/sbatch) array:

plass assembly <reads_R1.fq> <reads_R2.fq> <plass_assembly.faa> tmp

Running on a single node and referencing the node-specific /tmp directory works fine, but in our system not all nodes have the required large amount of tmp space necessary (as in #4 ). Directing plass to use a local tmp folder works fine and has a much larger space allotment.

However, when I take advantage of the parallelization of slurm to run multiple jobs at once, the tmp folder I specify looks like this:

11610453234058486865/
13117816727409383803/
2330489238614308671/
latest -> 2330489238614308671/

Am i correct in envisioning issues with the symlinking approach here? It looks like the multiple tasks might get confused with the "current" symlink as they iterate along.

Thoughts?

Thanks~

Intuitions regarding possible output and previous steps

First of all, thanks for the tools you developed, the extensive tutorials and the fast response. You people rock.

This is not a bug example but rather some doubts regarding the tool.

In our institute we processed 35 metagenomic samples from coastal marine water. I processed the dataset with a typical pipeline (Megahit assembly of the samples + gene prediction with both prodigal and metagenemarks, clustering 95% identity, 90% cov, -cov-mode 2) to create a gene catalog.

Separately I performed similar processes using Plass, performing a coassembly of 8/9 samples each and then clustered the dataset with linclust (95% identity, 90% cov, -cov-mode 2). Comparing the results, I obtained 30M in the first case and 529M in the second.

But does this improvement in the number makes sense? In your paper you talk about 2 to 10 fold improvement, but this is 20X at least. Maybe I have some errors in my processing. I was about to try for example the --cov-mode to 0 since I mistakenly specified 2 instead of 0, which can be one of the reasons of the high number.
In the paper I was unable to find quality filtering steps. I guess it is because the raw data was trimmed in the respective papers and then archived in SRA. But could you explain possible stringent approaches to the data cleaning before Plass to reassure the results obtained? I observed that in the biorxiv paper from C. Titus Brown they specified the following:

To avoid confounding effects of random
sequencing error in the analysis and increase specificity at the
cost of sensitivity, we focused only on high-abundance data:
we truncated all reads in the query neighborhoods at any
k-mer that appears fewer than five times, and ran Plass on
these abundance-trimmed reads from each neighborhood

Which tradeoff could be less stringent but useful?

Thanks for everything!

Compilation error - cannot find AminoAcidLookupTables.h

Expected Behavior

compilation works :)

Current Behavior

% cmake .
...
% make -j 4
...
[ 96%] Building CXX object src/CMakeFiles/plass.dir/workflow/Assembler.cpp.o
/mnt/home/ctb/plass/src/assembler/filternoncoding.cpp:7:35: fatal error: AminoAcidLookupTables.h: No such file or directory
 #include "AminoAcidLookupTables.h"
                                   ^
compilation terminated.
make[2]: *** [src/CMakeFiles/plass.dir/assembler/filternoncoding.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [src/CMakeFiles/plass.dir/all] Error 2
make: *** [all] Error 2

Error running mpi job with class

Dear sir:
I have no problem running a single plass job on my Linux cluster. However, I want to try for mpi plass job. Here are the details for the input and logs files.
I specify two nodes with each 32 cores with PBS -l nodes=2:ppn=32. In the job, I specify mpirun -np 64. I ran into error with writting to /tmp file. The problem also happens if I specify a local tmp file folder.
Please kindly suggest,

The sh file is like this:
#!/bin/bash
#PBS -N mpi

name you want to give your job

the default output file will use this

#PBS -q std

specify the queue you want to use

#PBS -l nodes=2:ppn=32
#PBS -j oe
#PBS -o logs

=======================================================

LOAD PBS MODULES

=======================================================

cd $PBS_O_WORKDIR

module load openmpi3/gcc/64/3.1.4
module load pbs

#cd $PBS_O_WORKDIR

##export OMP_NUM_THREADS=1

#mpirun -np 42 plass assemble examples/reads_1.fastq.gz examples/reads_2.fastq.gz assembly.fas tmp

#plass assemble examples/reads_1.fastq.gz examples/reads_2.fastq.gz assembly.fas tmp
for bin in $(cat list2)
do
echo ${bin}
mpirun -np 64 plass assemble $bin'_tr_hostout_R1.fastq' $bin'_tr_hostout_R2.fastq' $bin'.assembly.plass.protein.fas' /tmp
#plass nuclassemble $bin'_tr_hostout_R1.fastq' $bin'_tr_hostout_R2.fastq' $bin'.assembly.plass.nuc.fas' tmp
echo "###Assembly Sample" $bin" Start###"
date
done

list2 file:
P016_S9

The logs file is like the below:

P016_S9
assemble P016_S9_tr_hostout_R1.fastq P016_S9_tr_hostout_R2.fastq P016_S9.assembly.plass.protein.fas /tmp

assemble P016_S9_tr_hostout_R1.fastq P016_S9_tr_hostout_R2.fastq P016_S9.assembly.plass.protein.fas /tmp

MMseqs Version: 3.764a3
Substitution matrix nucl:nucleotide.out,aa:blosum62.out
Rescore mode 3
Allow wrapped scoring false
Remove hits by seq. id. and coverage false
E-value threshold 1e-05
Coverage threshold 0
Add backtrace MMseqs Version: 3.764a3
Substitution matrix nucl:nucleotide.out,aa:blosum62.out
Rescore mode 3
Allow wrapped scoring false
Remove hits by seq. id. and coverage false
E-value threshold 1e-05
Coverage threshold 0
Add backtrace false
Coverage mode 0
Seq. id. threshold 0.9
Min. alignment length 0
Seq. id. mode 0
Include identical seq. id. false
Sort results 0
Preload mode 0
Threads 64
Compressed 0
Verbosity 3
Alphabet size 13
K-mers per sequence 60
scale k-mers per sequence 0
Adjust k-mer length false
Mask residues 0
Mask lower case residues 0
K-mer size 14
Max sequence length 65535
Shift hash 5
Split memory limit 0
Include only extendable true
Skip repeating k-mers true
Min codons in orf 45
Max codons in length 32734
Max orf gaps 2147483647
Could not delete /tmp/latest!
Could not create symlink of /tmp/531455983002076514!
Could not delete /tmp/latest!
Could not delete /tmp/latest!
Could not delete /tmp/latest!
Could not write file /tmp/531455983002076514/assembler.sh!
Could not delete /tmp/latest!
Could not delete /tmp/latest!
Could not delete /tmp/latest!
Could not write file /tmp/531455983002076514/assembler.sh!
Could not delete /tmp/latest!
Could not delete /tmp/latest!
Could not delete /tmp/latest!
Could not delete /tmp/latest!
Could not delete /tmp/latest!

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.

Plass Output (for bugs)

Please make sure to also post the complete output of Plass. You can use gist.github.com for large output.

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Git commit used (The string after "Plass Version:" when you execute Plass without any parameters):
Which Plass version was used (Statically-compiled, self-compiled, Homebrew, etc.):
For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation:
Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
Operating system and version:

Option to change genetic code (translation table)?

Expected Behavior

Plass provides a flag for the translation table (genetic code) used to perform the conceptual six-frame translation of nucleotides to amino acids. For sequences expected to be bacterial this would correspond to the translation table 11.

Current Behavior

There is no option for the translation table in the command line flags.

Your Environment

I am running Plass from the static binary hosted at https://mmseqs.com/plass/plass-static_sse41.tar.gz

map of input sequences to assembled sequence

I was wondering if PLASS kept track of which sequences it used for each assembled sequence, and @milot-mirdita told me I would have to search with the assembled sequences against the input sequences to get that information.

Why this is relevant to me: we study a non-model organism using scRNA-seq. We have no high quality genome for it or any closely related species, so we map our reads agains a de-novo transcriptome. Owing to the absurd polymorphism levels present in the genome the usual Trinity pipeline produces close to 1 million "genes", making all downstream analysis very complicated. I thought that going to the amino acid level with a tool like PLASS would improve things.

Using scRNA-seq and de-novo transcriptomes is a great way to study non-model organisms without known/well-annotated genomes (recent examples are the Morpho-Seq paper, or this cell type study in Spongilla). It seems like PLASS could be very useful in this niche. I promise to write the tutorial when this feature is added!

MPI local-tmp

Hi
does PLASS have the option for --local-tmp like in MMseqs2 for MPI jobs? I checked the

Thanks
Antonio

Quantification

Hi,
After a PLASS assembly, is there a way to obtain quantification of those peptides in multiple samples ?
Thank you in advance.
Sebastien

Paired read prediction - mergereads failed

Expected Behavior

Hello,
I am trying to run PLASS on a curated set of marine viral metagenomic reads. I have two read files, and I am trying to run PLASS on them but I am getting the following error:

Start merging reads.
Segmentation fault (core dumped)
Error: mergereads failed
deactivate does not accept arguments
remainder_args: ['PLASS']

Current Behavior

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.

Here is the script I am using for the plass assembler:

conda activate PLASS

/home/delaney/miniconda3/envs/PLASS/bin/plass assemble /home/delaney/5YV/test2/02-kraken2-viruses-only/extracted1.fq /home/delaney/5YV/test2/02-kraken2-viruses-only/extracted2.fq assembly.fas tmp

conda deactivate PLASS

Plass Output (for bugs)

Please make sure to also post the complete output of Plass. You can use gist.github.com for large output.

Include only extendable true
Skip repeating k-mers true
Min codons in orf 45
Max codons in length 32734
Max orf gaps 2147483647
Contig start mode 2
Contig end mode 2
Orf start mode 1
Forward frames 1,2,3
Reverse frames 1,2,3
Translation table 1
Translate orf 0
Use all table starts false
Offset of numeric ids 0
Create lookup 0
Protein Filter Threshold 0.2
Filter Proteins 1
Search iterations 12
Delete temporary files incremental 1
Remove temporary files false
MPI runner
Database type 0
Shuffle input database true
Createdb mode 0
Write lookup file 1

PAIRED END MODE
mergereads /home/delaney/5YV/test2/02-kraken2-viruses-only/extracted1.fq /home/delaney/5YV/test2/02-kraken2-viruses-only/extracted2.fq /home/delaney/5YV/test2/scripts/tmp/1996441830643183315/nucl_reads -v 3

Start merging reads.
Segmentation fault (core dumped)
Error: mergereads failed
deactivate does not accept arguments
remainder_args: ['PLASS']

Context

Providing context helps us come up with a solution and improve our documentation for the future.

these are viral metagenomic reads sequenced on a novaseq. they were identified as viral using kraken2 and the reads from my dataset that were viral were then put into the 2 files extracted1.fq and extracted2.fq (for fwd and rev).

Your Environment

I am running this in a conda environment, where i installed plass using bioconda on a linux machine.

Reduce TMP disk usage

Hi Martin
it would be possible that PLASS has an option to remove the intermediate files (i.e. pref_, aln_, assembly_) of the iterations that are not going to be used anymore in the following steps. For some of the assemblies, the disk usage explodes and goes up to several terabytes. As a temporal solution I added to assembler.sh the following lines to remove the files from previous steps:

  if [ "${STEP}" -ge 2 ]; then
    PSTEP="$((STEP-2))"
    rm -f "${TMP_PATH}/pref_${PSTEP}"
    rm -f "${TMP_PATH}/pref_${PSTEP}"_*
    rm -f "${TMP_PATH}/pref_${PSTEP}".*
    rm -f "${TMP_PATH}/aln_${PSTEP}"
    rm -f "${TMP_PATH}/aln_${PSTEP}"_*
    rm -f "${TMP_PATH}/aln_${PSTEP}".*
    rm -f "${TMP_PATH}/assembly_${PSTEP}"
    rm -f "${TMP_PATH}/assembly_${PSTEP}"_*
    rm -f "${TMP_PATH}/assembly_${PSTEP}".*
  fi

Many thanks
Antonio

Defining minimum ORF length doesn't work

Hi
when specifying a --min-length to the assemble workflow to set a minimum ORF length it is used as maximum length as seen here

The output of the extractorfs workflow on commit 53a2eff:

plass assemble tm8_1.truncated.fasta assembly.fas tmp --sub-mat PAM30.out --min-length 10
assemble tm8_1.truncated.fasta assembly.fas tmp --sub-mat PAM30.out --min-length 10

extractorfs /vol/attached/arctic/ORFs/tmp/18009091505483605488/nucl_reads /vol/attached/arctic/ORFs/tmp/18009091505483605488/nucl_6f_start --min-length 20 --max-length 10 --max-gaps 0 --contig-start-mode 1 --contig-end-mode 0 --orf-start-mode 0 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --use-all-table-starts 0 --id-offset 0 --threads 28 --compressed 0 -v 3

Thanks
Antonio

Is it possible to read sequences from stdin or to provide multiple unpaired files at once?

Expected Behavior

I'm looking for a way to assembled multiple samples together when I have many non-paired reads as input.

Current Behavior

I get an error like this: Input /dev/fd/63 does not exist.

Steps to Reproduce (for bugs)

When running the command like this:

docker run --rm -it -v "$(pwd):/app" -w /app soedinglab/plass assemble <(cat seqs1.fna seqs2.fna) seqs12.assembly.faa tmp

Plass Output (for bugs)

Input /dev/fd/63 does not exist.

Context

The reason that this would be helpful is that I have a lot of samples that I would like to combine into a single assembly. It would be nice to be able to cat them all in a subshell directly in the call to plass as above so I could avoid writing a big file first.

Alternatively, does plass take multiple unpaired files from different samples at the same time? The usage string looks like this:

Usage: plass assemble <i:fast(a|q)File[.gz]> | <i:fastqFile1_1[.gz] ... <i:fastqFileN_1[.gz] <i:fastqFile1_2[.gz] ... <i:fastqFileN_2[.gz]> <o:fastaFile> <tmpDir> [options]

It looks like you can provide multiple paired-end read files, but only a single unpaired file. Can you also but multiple unpaired files and have plass treat them as unpaired? Something like this:

plass assemble sample_a.fna sample_b.fna sample_c.fna out tmp

Would sample_a.fna sample_b.fna sample_c.fna all be treated as unpaired?

Your Environment

Git commit: c4f7b2392f26c71e8c596466a3e620297c748fa7. I used the docker version. I have support for SSE4.2.

Protein sequence abundance

Hi there,

I've just started using PLASS, specifically plass assemble, and I really like it!

This isn't an issue but a potential enhancement. I was wondering how I might be able to recover abundance information for each protein assembled. Its not as simple as with nucleotide contigs since I'm unable to align the reads back to the assembly in this case.

Could the abundance of proteins be included in the header, as SPAdes does? Alternatively, a table mapping header to its respective abundance would be convenient as well. I hope I haven't misinterpreted the output and its already provided :)

Thanks!
Connor

ID collisions in FASTA output

Expected Behavior

Unique ID for every entry in the resulting FASTA file.

Current Behavior

Non-uniqueness of incrementing number at the end of the FASTA header ID results in collisions.
There seem to be two types of collisions:

numbers used twice (rounding .5?)
scientific notation (formatter?)

Example (sequence data lines removed):

>NXL_FL_2_Virus.fasta_0.5
>NXL_FL_2_Virus.fasta_1.5
>NXL_FL_2_Virus.fasta_2.5
>NXL_FL_2_Virus.fasta_3.5
[...]
>NXL_FL_2_Virus.fasta_99997.5
>NXL_FL_2_Virus.fasta_99998.5
>NXL_FL_2_Virus.fasta_99999.5
>NXL_FL_2_Virus.fasta_100000
>NXL_FL_2_Virus.fasta_100002
>NXL_FL_2_Virus.fasta_100002
>NXL_FL_2_Virus.fasta_100004
>NXL_FL_2_Virus.fasta_100004
>NXL_FL_2_Virus.fasta_100006
>NXL_FL_2_Virus.fasta_100006
[...]
>NXL_FL_2_Virus.fasta_999996
>NXL_FL_2_Virus.fasta_999996
>NXL_FL_2_Virus.fasta_999998
>NXL_FL_2_Virus.fasta_999998
>NXL_FL_2_Virus.fasta_1e+06
>NXL_FL_2_Virus.fasta_1e+06
>NXL_FL_2_Virus.fasta_1e+06
>NXL_FL_2_Virus.fasta_1e+06
>NXL_FL_2_Virus.fasta_1e+06
>NXL_FL_2_Virus.fasta_1e+06
>NXL_FL_2_Virus.fasta_1.00001e+06
>NXL_FL_2_Virus.fasta_1.00001e+06
>NXL_FL_2_Virus.fasta_1.00001e+06
>NXL_FL_2_Virus.fasta_1.00001e+06
[...]

Steps to Reproduce (for bugs)

@martin-steinegger ran PLASS for me (thanks again!) and provided me with the output FASTA files.

Plass Output (for bugs)

see above

Context

Your Environment

See above.

Incorrect headers for Nuc -> Protein

Hello,

Recently using this program I had tried

plass nuclassemble reads_1.fastq.gz assembly_testnu.fas tmp

Output ::

>1541_chr1_0_114757654_114757803_7891_JFMU01000067.1 AGCTGGAATTTCTAAAAAAGATATTAATGGCTTTATGATAAGAAAACTAAAGAATATTGAAATAA

However when trying to use

plass assemble reads_1.fastq.gz assembly_testpep.fas tmp

The headers are not there and I am seeing this string

>0 2+146 3 RLAFNSRKAMDNVTLTLELPPNAELTPFPGRQTISWTVDLKQGDNVLALPINVLFPGSGKLVAHLDDGTRRKTFSTAIPGNTEPSS*

Any ideas? Thank you

Issue with Docker image soedinglab/plass:latest

When I try to run plass in a docker container from image soedinglab/plass:latest on my Linux box with the command:

user@ubuntu:~/data$ docker run -t -i --rm -v "$(pwd):/app" -w /app soedinglab/plass:latest assemble reads1 reads2 assembly.fas tmp

I get this error message:
/usr/local/bin/plass: 8: /usr/local/bin/plass: Syntax error: newline unexpected (expecting ")")

The container runs fine with soedinglab/plass:version-3 or earlier.

Environment:
Ubuntu 20.04.2 LTS focal x86_64
Kernel: 5.4.0-65-generic
Docker engine: 20.10.3 linux/amd64

Install issues

Expected Behavior

I'm having issues installing plass using all options. I work in CentOS 8 x86_64

Current Behavior

Install from bioconda appears to succeed but then the plass executable does not exist
Install from the static file results in:
--2022-01-14 09:04:13-- https://mmseqs.com/plass/plass-static_sse41.tar.gz
Resolving mmseqs.com (mmseqs.com)... 141.5.100.26
Connecting to mmseqs.com (mmseqs.com)|141.5.100.26|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2022-01-14 09:04:14 ERROR 404: Not Found.
Install from git clone errors during make with:
make -j 4 && make install
Consolidate compiler generated dependencies of target microtar
Consolidate compiler generated dependencies of target ksw2
Consolidate compiler generated dependencies of target cacode
Consolidate compiler generated dependencies of target alp
[ 2%] Built target microtar
[ 2%] Built target ksw2
[ 2%] Built target cacode
[ 7%] Built target alp
[ 7%] Building C object lib/mmseqs/lib/tinyexpr/CMakeFiles/tinyexpr.dir/tinyexpr.c.o
[ 7%] Generating ../generated/VTML80.out.h
[ 7%] Building C object lib/mmseqs/lib/zstd/build/cmake/lib/CMakeFiles/libzstd_static.dir////lib/common/entropy_common.c.o
[ 7%] Generating ../generated/assemble.sh.h
: invalid option
make[2]: *** [lib/mmseqs/data/CMakeFiles/generated.dir/build.make:159: lib/mmseqs/generated/VTML80.out.h] Error 1
make[1]: *** [CMakeFiles/Makefile2:803: lib/mmseqs/data/CMakeFiles/generated.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
: invalid option
make[2]: *** [data/CMakeFiles/local-generated.dir/build.make:81: generated/assemble.sh.h] Error 1
make[2]: *** Waiting for unfinished jobs....
[ 7%] Generating ../generated/nuclassemble.sh.h
[ 7%] Building C object lib/mmseqs/lib/zstd/build/cmake/lib/CMakeFiles/libzstd_static.dir////lib/common/fse_decompress.c.o
: invalid option
make[2]: *** [data/CMakeFiles/local-generated.dir/build.make:97: generated/nuclassemble.sh.h] Error 1
make[1]: *** [CMakeFiles/Makefile2:861: data/CMakeFiles/local-generated.dir/all] Error 2
[ 8%] Building C object lib/mmseqs/lib/zstd/build/cmake/lib/CMakeFiles/libzstd_static.dir////lib/common/threading.c.o
[ 8%] Building C object lib/mmseqs/lib/zstd/build/cmake/lib/CMakeFiles/libzstd_static.dir////lib/common/pool.c.o
[ 8%] Building C object lib/mmseqs/lib/zstd/build/cmake/lib/CMakeFiles/libzstd_static.dir////lib/common/zstd_common.c.o
[ 9%] Building C object lib/mmseqs/lib/zstd/build/cmake/lib/CMakeFiles/libzstd_static.dir////lib/common/error_private.c.o
[ 9%] Building C object lib/mmseqs/lib/zstd/build/cmake/lib/CMakeFiles/libzstd_static.dir////lib/compress/hist.c.o
[ 9%] Building C object lib/mmseqs/lib/zstd/build/cmake/lib/CMakeFiles/libzstd_static.dir////lib/common/xxhash.c.o
[ 10%] Building C object lib/mmseqs/lib/zstd/build/cmake/lib/CMakeFiles/libzstd_static.dir////lib/compress/fse_compress.c.o
[ 11%] Linking C static library libtinyexpr.a
[ 11%] Built target tinyexpr
[ 11%] Building C object lib/mmseqs/lib/zstd/build/cmake/lib/CMakeFiles/libzstd_static.dir////lib/compress/huf_compress.c.o
[ 12%] Building C object lib/mmseqs/lib/zstd/build/cmake/lib/CMakeFiles/libzstd_static.dir////lib/compress/zstd_compress.c.o
[ 12%] Building C object lib/mmseqs/lib/zstd/build/cmake/lib/CMakeFiles/libzstd_static.dir////lib/compress/zstdmt_compress.c.o
[ 12%] Building C object lib/mmseqs/lib/zstd/build/cmake/lib/CMakeFiles/libzstd_static.dir////lib/compress/zstd_fast.c.o
[ 13%] Building C object lib/mmseqs/lib/zstd/build/cmake/lib/CMakeFiles/libzstd_static.dir////lib/compress/zstd_double_fast.c.o
[ 13%] Building C object lib/mmseqs/lib/zstd/build/cmake/lib/CMakeFiles/libzstd_static.dir////lib/compress/zstd_lazy.c.o
[ 13%] Building C object lib/mmseqs/lib/zstd/build/cmake/lib/CMakeFiles/libzstd_static.dir////lib/compress/zstd_opt.c.o
[ 14%] Building C object lib/mmseqs/lib/zstd/build/cmake/lib/CMakeFiles/libzstd_static.dir////lib/compress/zstd_ldm.c.o
[ 14%] Building C object lib/mmseqs/lib/zstd/build/cmake/lib/CMakeFiles/libzstd_static.dir////lib/decompress/huf_decompress.c.o
[ 14%] Building C object lib/mmseqs/lib/zstd/build/cmake/lib/CMakeFiles/libzstd_static.dir////lib/decompress/zstd_decompress.c.o
[ 15%] Building C object lib/mmseqs/lib/zstd/build/cmake/lib/CMakeFiles/libzstd_static.dir////lib/decompress/zstd_decompress_block.c.o
[ 15%] Building C object lib/mmseqs/lib/zstd/build/cmake/lib/CMakeFiles/libzstd_static.dir////lib/decompress/zstd_ddict.c.o
[ 15%] Building C object lib/mmseqs/lib/zstd/build/cmake/lib/CMakeFiles/libzstd_static.dir////lib/dictBuilder/cover.c.o
[ 16%] Building C object lib/mmseqs/lib/zstd/build/cmake/lib/CMakeFiles/libzstd_static.dir////lib/dictBuilder/fastcover.c.o
[ 16%] Building C object lib/mmseqs/lib/zstd/build/cmake/lib/CMakeFiles/libzstd_static.dir////lib/dictBuilder/divsufsort.c.o
[ 16%] Building C object lib/mmseqs/lib/zstd/build/cmake/lib/CMakeFiles/libzstd_static.dir////lib/dictBuilder/zdict.c.o
[ 17%] Building C object lib/mmseqs/lib/zstd/build/cmake/lib/CMakeFiles/libzstd_static.dir////lib/deprecated/zbuff_common.c.o
[ 17%] Building C object lib/mmseqs/lib/zstd/build/cmake/lib/CMakeFiles/libzstd_static.dir////lib/deprecated/zbuff_compress.c.o
[ 17%] Building C object lib/mmseqs/lib/zstd/build/cmake/lib/CMakeFiles/libzstd_static.dir////lib/deprecated/zbuff_decompress.c.o
[ 18%] Linking C static library libzstd.a
[ 18%] Built target libzstd_static
make: *** [Makefile:136: all] Error 2

Context

lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Model name: Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
Stepping: 4
CPU MHz: 3500.000
CPU max MHz: 3500.0000
CPU min MHz: 1200.0000
BogoMIPS: 5387.14
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 30720K
NUMA node0 CPU(s): 0-11,24-35
NUMA node1 CPU(s): 12-23,36-47

Some Issues about the length of protein sequences

Hi there,
Here I'd like to thanks for this excellent tool for assemble short read sequencing data on a protein level, it improved the utilization of reads to a large extent.
When I used the plass assemble , some question puzzled me. Firstly, when I used the --min-length to control the length of residues of output. Unfortunatly the output is empty, despite the value is 100. Then, when I checked the length of output, I found that the length of many residue larger than 5000 residues, which seems abnormal. How can we prevent this from happening?
The command I used to assemble as follows:
plass assemble --threads 32 --min-seq-id 0.99 clean_reads/ERR_YZYC_1.fastq clean_reads/ERR_YZYC_2.fastq ERR_YZYC_assembly.fas ERR_YZYCt
Plass Version: c4aaa98

Insitu_prot_4747305 len:8946
Insitu_prot_4748790 len:5383
Insitu_prot_4882950 len:3398
......

The output file is empty

I have one nucleotide fasta file (nuc1.fasta) and I run the following command

plass assemble nuc1.fasta result.fasta tmp

from my understanding the final output should be result.fasta

but the result.fasta is empty

Current Behavior

result.fasta is empty

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.
checked

Plass Output (for bugs)

Please make sure to also post the complete output of Plass. You can use gist.github.com for large output.
no results

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Plass Version: 41d03ca
compiled version

For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation:
cmake version 3.16.3
Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
Intel® Core™ i9-9900KF CPU @ 3.60GHz × 16 32g RAM
Operating system and version:
Ubuntu 20.04.1 LTS

Segmentation fault translatenucs

Hi,
I am running into the following error:

extractorfs /tmp/slurm-7400926/tmp/15758272862975753372/nucl_reads /tmp/slurm-7400926/tmp/15758272862975753372/nucl_6f_start --min-length 20 --max-length 45 --max-gaps 0 --contig-start-mode 1 --contig-end-mode 0 --orf-start-mode 0 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --use-all-table-starts 0 --id-offset 0 --threads 8 --compressed 0 -v 3

[=================================================================] 31.46M 23s 943ms
Time for merging files: 0h 0m 0s 423ms
Time for merging files: 0h 0m 0s 521ms
Time for processing: 0h 0m 31s 724ms
translatenucs /tmp/slurm-7400926/tmp/15758272862975753372/nucl_6f_start /tmp/slurm-7400926/tmp/15758272862975753372/aa_6f_start --translation-table 1 --add-orf-stop 1 -v 3 --compressed 0 --threads 8

[=================================================================] 944.07K 0s 331ms
Time for merging files: 0h 0m 0s 515ms
Time for processing: 0h 0m 1s 97ms
extractorfs /tmp/slurm-7400926/tmp/15758272862975753372/nucl_reads /tmp/slurm-7400926/tmp/15758272862975753372/nucl_6f_long --min-length 45 --max-length 32734 --max-gaps 0 --contig-start-mode 2 --contig-end-mode 2 --orf-start-mode 0 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --use-all-table-starts 0 --id-offset 0 --threads 8 --compressed 0 -v 3

[=================================================================] 31.46M 22s 660ms
Time for merging files: 0h 0m 0s 0ms
Time for merging files: 0h 0m 0s 0ms
Time for processing: 0h 0m 28s 889ms
tmp/15758272862975753372/assembler.sh: line 62: 31039 Segmentation fault      "$MMSEQS" translatenucs "${TMP_PATH}/nucl_6f_long" "${TMP_PATH}/aa_6f_long" ${TRANSLATENUCS_PAR}
Error: translatenucs long step died

Core dump during assembly

My processes are constantly crashing in the middle of the assembly. I get a *** glibc detected *** plass: invalid fastbin entry (free): 0x0000000000815a60 ***.

plass.segmentation.log

I installed Plass through Conda, due to the old libraries of the server I'm working in.

I'm using Plass 2.c7e35, 2TB of memory, 16 × Intel(R) Xeon(R) CPU E5-4650 v2 @ 2.40GHz, Red Hat Enterprise Linux Server release 6.5.

Final output file size 0

Expected Behavior

I expect plass to output a fasta with amino acid sequences

Current Behavior

plass runs, but outputs a file with no amino acid sequences

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.

wget -O SRS476121_69.fna.cdbg_ids.reads.fa.gz https://osf.io/p7fqc/download
plass assemble SRS476121_69.fna.cdbg_ids.reads.fa.gz SRS476121_69.cdbg_ids.reads.plass.faa tmp

Plass Output (for bugs)

Log file: 11388349399477705273_log.txt
File sizes in tmp for plass run:
11388349399477705273_file_sizes.txt

Context

I am assembling reads that I think are derived from a single organism from a metagenome (e.g. reads from a spacegraphcats query). The reads are 101 bases long. The read file is 2.2GB, and I am treating it as single end.

Your Environment

I ran plass using conda, with the following environment:

channels:
   - conda-forge
   - bioconda
   - defaults
dependencies:
   - plass=3.764a3
   - cd-hit=4.8.1
   - paladin=1.4.6
   - samtools=1.10
   - salmon=0.15.0

I am on a linux computer, and used plass with 128 gb of ram and 8 CPU (Ubuntu 18.04.4 LTS (GNU/Linux 4.15.0-70-generic x86_64))

Update MMseqs2 submodule

Hi @martin-steinegger
I am trying to install PLASS and I am finding that the changes in rescorediagonal.cpp to avoid the infinite loop are not pulled

Thanks
Antonio

Using already predicted ORFs from reads

Hi
I would like to use the ORFs that I predicted using another approach. I checked the documentation and I haven't seen an option to deactivate the extractorfs step. Is this possible or do you have any suggestions on how to use already predicted ORFs?

Thank you very much
Antonio

Quality trimming reads?

Dear Plass team,

I am very interested in using this tool for protein assembly of soil metagnomes. I am just curious if you would recommend to first quality filter and trim reads, e.g., using fastp. Will this improve the precision of Plass, or will the potential reduction in read length from the trimming come at too great a cost in sensitivity? What would you recommend?

Best,
Tim

Insights needed for comparison with other assemblers

Hi there,

Congratulations on your tool - I'm really excited about PenguiN as this could be an interesting alternative to explore. As such, I've set out to compare it our group's gold-standard for environmental metagenomics, metaSPAdes and am getting some really interesting data that maybe you could help me interpret to see if we should consider changing to using PenguiN or not?

Here's how everything has been run so far on an example environmental metagenome:

metaSPAdes:

python3 metaspades.py -m 1150 -1 $read1 -2 $read2 -t 10 -o ${assembly}_metaspades3.15.5

PenguiN:

penguin guided_nuclassemble $read1 $read2 --threads 10 1 ${assembly}_PenguiN.fasta tmp

PenguiN_wmods:

penguin guided_nuclassemble $read1 $read2 --threads 10 --max-seq-len 1000000 --contig-output-mode 0 --num-iterations 10 --min-length 35 --use-all-table-starts 1 ${assembly}_PenguiN_wmods.fasta tmp

Mapping statistics really vary across methods
Using bowtie2, I mapped each assembly to its reads as below, after filtering each assembly to contain only scaffolds/contigs >1000 bp:

bowtie2-build $assembly bt2/$assembly > bt2/$assembly.log
bowtie2 -p 10 --sensitive -x bt2/$assembly -1 $read1 -2 $read2 2> $assembly.sam.log | shrinksam-master/shrinksam > $assembly.sam

Looking at the log files here's what I see:

Is this something you see a lot in your experience? In principle, I'd say that higher percentages of 'aligned concordantly >1 times' should be indicative of multimapping and thus not a good sign?

Quite a lot of difference in mean/median contig lengths:

Here's a quick plot of median/mean contig lengths (scaffolds for metaSPAdes), with standard deviations as vertical lines from each point:

This makes sense when looking at length frequency distributions for each assembly:

Do you contemplate adding a scaffolding module to PenguiN? I wonder how these values could change with that!

Finally, here's some more general stats on each assembly:

I think there's a lot of potential in PenguiN - I'm still reading up on it, but will take any insights you're willing to offer as you look at this data! I can also share rps3 taxonomic profiles I've run on each assembly if you'd want.

Thanks in advance for the attention!

missing handling of small number of input reads

Expected Behavior

Throw an exception or end the program if the number of reads is small that it will fail to be assembled.

Current Behavior

Entered an infinite loop running on 24 threads (Different from the provided --threads parameter).

Steps to Reproduce (for bugs)

Create fastq files

R1.fastq

@SRR11015356.20904790
AGGGAACCAACTACAGACTGGGGGGACAACTCAGTGGTTAGGGTCACGTGTTCTTCTTGTGGCGGATGAGGGTTGGGATCCCAGAACCCACATGGGAGTCTCCAACCACCTGTAACTCTAGTTCCAGGCGACCCGATGATGGCCACCTCTG
+
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
@SRR11015356.20907155
AGGGAACCAACTACAGACTGGGGGGACAACTCAGTGGTTAGGGTCACGTGTTCTTCTTGTGGCGGATGAGGGTTGGGATCCCAGAACCCACATGGGAGTCTCCAACCACCTGTAACTCTAGTTCCAGGCGACCCGATGATGGCCACCTCTG
+
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

R2.fastq

@SRR11015356.20904790
TTTTGAAGCCACTATATGATCCAGAATGAAACTGTAGGAAACCAAGCTCCAGCCTAAAAAAGCCTCATCAATTTTTTTAAAAGAATTTTGTTTTTAGGTTTATGTTTTTATTTTGTGTGCAAGAGAGTTTTTCCTGCCTGTGTGTATGAGC
+
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
@SRR11015356.20907155
TTTTGAAGCCACTATATGATCCAGAATGAAACTGTAGGAAACCAAGCTCCAGCCTAAAAAAGCCTCATCAATTTTTTTAAAAGAATTTTGTTTTTAGGTTTATGTTTTTATTTTGTGTGCAAGAGAGTTTTTCCTGCCTGTGTGTATGAGC
+
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

Plass Output (for bugs)

cmd: plass nuclassemble R1.fastq R2.fastq assembled.fa tmp --threads 2 -v 3

Plass Output Log

MMseqs Version:                     	3.764a3
Check for circular sequences        	true
Minimum contig length               	1000
Clustering threshold                	0.97
Number search iterations            	12
Remove temporary files              	false
Delete temporary files incremental  	1
MPI runner                          	
Substitution matrix                 	nucl:nucleotide.out,aa:blosum62.out
Alphabet size                       	5
Seq. id. threshold                  	0.97
K-mers per sequence                 	60
scale k-mers per sequence           	0.1
Adjust k-mer length                 	false
Mask residues                       	0
Mask lower case residues            	0
Coverage mode                       	0
K-mer size                          	22
Coverage threshold                  	0
Max sequence length                 	65535
Shift hash                          	5
Split memory limit                  	0
Include only extendable             	true
Skip repeating k-mers               	true
Threads                             	2
Compressed                          	0
Verbosity                           	3
Rescore mode                        	3
Allow wrapped scoring               	false
Remove hits by seq. id. and coverage	false
E-value threshold                   	1e-05
Add backtrace                       	false
Min. alignment length               	0
Seq. id. mode                       	0
Include identical seq. id.          	false
Sort results                        	0
Preload mode                        	0
Chop Cycle                          	true

Temporary folder tmp does not exist or is not a directory.
Created directory tmp
PAIRED END MODE
mergereads R1.fastq R2.fastq /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/nucl_reads -v 3 

Start merging reads.
Time for merging to nucl_reads: 0h 0m 0s 0ms
Time for merging to nucl_reads_h: 0h 0m 0s 0ms

Done.
Time for processing: 0h 0m 0s 0ms
STEP: 0
kmermatcher /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/nucl_reads /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_0 --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 5 --min-seq-id 0.97 --kmer-per-seq 60 --kmer-per-seq-scale 0.1 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 22 -c 0 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 1 --ignore-multi-kmer 1 --threads 2 --compressed 0 -v 3 

kmermatcher /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/nucl_reads /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_0 --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 5 --min-seq-id 0.97 --kmer-per-seq 60 --kmer-per-seq-scale 0.1 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 22 -c 0 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 1 --ignore-multi-kmer 1 --threads 2 --compressed 0 -v 3 

Database size: 4 type: Nucleotide

Estimated memory consumption 0 MB
Generate k-mers list for 1 split
[=================================================================] 100.00% 4 0s 0ms      

Adjusted k-mer length 22
Sort kmer 0h 0m 0s 0ms
Sort by rep. sequence 0h 0m 0s 0ms
Time for fill: 0h 0m 0s 0ms
Time for merging to pref_0: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 1ms
rescorediagonal /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/nucl_reads /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/nucl_reads /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_0 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_0 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 3 --wrapped-scoring 0 --filter-hits 0 -e 1e-05 -c 0 -a 0 --cov-mode 0 --min-seq-id 0.97 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 2 --compressed 0 -v 3 

[=================================================================] 100.00% 4 0s 0ms      
Time for merging to aln_0: 0h 0m 0s 0ms=====>                     ] 66.67% 3 eta 0s       
Time for processing: 0h 0m 0s 1ms
assembleresults /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/nucl_reads /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_0 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_0 --min-seq-id 0.97 --max-seq-len 65535 --threads 2 -v 3 --rescore-mode 3 

Compute assembly.
[=================================================================] 100.00% 4 0s 0ms      
Time for merging to assembly_0: 0h 0m 0s 0ms

Done.
Time for processing: 0h 0m 0s 3ms
cyclecheck /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_0 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_0_cycle --max-seq-len 65535 --chop-cycle 1 --threads 2 -v 3 

Time for merging to assembly_0_cycle: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 0ms
STEP: 1
kmermatcher /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_0_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_1 --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 5 --min-seq-id 0.97 --kmer-per-seq 60 --kmer-per-seq-scale 0.1 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 22 -c 0 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 1 --ignore-multi-kmer 1 --threads 2 --compressed 0 -v 3 

kmermatcher /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_0_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_1 --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 5 --min-seq-id 0.97 --kmer-per-seq 60 --kmer-per-seq-scale 0.1 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 22 -c 0 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 1 --ignore-multi-kmer 1 --threads 2 --compressed 0 -v 3 

Database size: 4 type: Nucleotide

Estimated memory consumption 0 MB
Generate k-mers list for 1 split
[=================================================================] 100.00% 4 0s 0ms      

Adjusted k-mer length 22
Sort kmer 0h 0m 0s 0ms
Sort by rep. sequence 0h 0m 0s 0ms
Time for fill: 0h 0m 0s 0ms
Time for merging to pref_1: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 2ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_0 

Time for processing: 0h 0m 0s 0ms
rescorediagonal /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_0_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_0_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_1 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_1 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 3 --wrapped-scoring 0 --filter-hits 0 -e 1e-05 -c 0 -a 0 --cov-mode 0 --min-seq-id 0.97 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 2 --compressed 0 -v 3 

[=================================================================] 100.00% 4 0s 0ms      
Time for merging to aln_1: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_0 

Time for processing: 0h 0m 0s 0ms
assembleresults /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_0_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_1 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_1 --min-seq-id 0.97 --max-seq-len 65535 --threads 2 -v 3 --rescore-mode 3 

Compute assembly.
[=================================================================] 100.00% 4 0s 1ms      
Time for merging to assembly_1: 0h 0m 0s 0ms

Done.
Time for processing: 0h 0m 0s 5ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_0_noneCycle 

Time for processing: 0h 0m 0s 0ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_0 

Time for processing: 0h 0m 0s 0ms
cyclecheck /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_1 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_1_cycle --max-seq-len 65535 --chop-cycle 1 --threads 2 -v 3 

Time for merging to assembly_1_cycle: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_0_cycle 

Time for processing: 0h 0m 0s 0ms
STEP: 2
kmermatcher /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_1_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_2 --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 5 --min-seq-id 0.97 --kmer-per-seq 60 --kmer-per-seq-scale 0.1 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 22 -c 0 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 1 --ignore-multi-kmer 1 --threads 2 --compressed 0 -v 3 

kmermatcher /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_1_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_2 --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 5 --min-seq-id 0.97 --kmer-per-seq 60 --kmer-per-seq-scale 0.1 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 22 -c 0 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 1 --ignore-multi-kmer 1 --threads 2 --compressed 0 -v 3 

Database size: 4 type: Nucleotide

Estimated memory consumption 0 MB
Generate k-mers list for 1 split
[=================================================================] 100.00% 4 0s 0ms      

Adjusted k-mer length 22
Sort kmer 0h 0m 0s 0ms
Sort by rep. sequence 0h 0m 0s 0ms
Time for fill: 0h 0m 0s 0ms
Time for merging to pref_2: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_1 

Time for processing: 0h 0m 0s 0ms
rescorediagonal /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_1_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_1_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_2 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_2 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 3 --wrapped-scoring 0 --filter-hits 0 -e 1e-05 -c 0 -a 0 --cov-mode 0 --min-seq-id 0.97 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 2 --compressed 0 -v 3 

[=================================================================] 100.00% 4 0s 0ms      
Time for merging to aln_2: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_1 

Time for processing: 0h 0m 0s 0ms
assembleresults /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_1_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_2 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_2 --min-seq-id 0.97 --max-seq-len 65535 --threads 2 -v 3 --rescore-mode 3 

Compute assembly.
[=================================================================] 100.00% 4 0s 0ms      
Time for merging to assembly_2: 0h 0m 0s 0ms

Done.
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_1_noneCycle 

Time for processing: 0h 0m 0s 0ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_1 

Time for processing: 0h 0m 0s 0ms
cyclecheck /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_2 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_2_cycle --max-seq-len 65535 --chop-cycle 1 --threads 2 -v 3 

Time for merging to assembly_2_cycle: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 0ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_1_cycle 

Time for processing: 0h 0m 0s 0ms
STEP: 3
kmermatcher /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_2_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_3 --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 5 --min-seq-id 0.97 --kmer-per-seq 60 --kmer-per-seq-scale 0.1 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 22 -c 0 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 1 --ignore-multi-kmer 1 --threads 2 --compressed 0 -v 3 

kmermatcher /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_2_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_3 --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 5 --min-seq-id 0.97 --kmer-per-seq 60 --kmer-per-seq-scale 0.1 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 22 -c 0 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 1 --ignore-multi-kmer 1 --threads 2 --compressed 0 -v 3 

Database size: 4 type: Nucleotide

Estimated memory consumption 0 MB
Generate k-mers list for 1 split
[=================================================================] 100.00% 4 0s 0ms      

Adjusted k-mer length 22
Sort kmer 0h 0m 0s 0ms
Sort by rep. sequence 0h 0m 0s 0ms
Time for fill: 0h 0m 0s 0ms
Time for merging to pref_3: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_2 

Time for processing: 0h 0m 0s 0ms
rescorediagonal /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_2_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_2_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_3 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_3 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 3 --wrapped-scoring 0 --filter-hits 0 -e 1e-05 -c 0 -a 0 --cov-mode 0 --min-seq-id 0.97 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 2 --compressed 0 -v 3 

[=================================================================] 100.00% 4 0s 0ms      
Time for merging to aln_3: 0h 0m 0s 0ms                           ] 33.33% 2 eta 0s       
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_2 

Time for processing: 0h 0m 0s 0ms
assembleresults /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_2_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_3 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_3 --min-seq-id 0.97 --max-seq-len 65535 --threads 2 -v 3 --rescore-mode 3 

Compute assembly.
[=================================================================] 100.00% 4 0s 0ms      
Time for merging to assembly_3: 0h 0m 0s 0ms

Done.
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_2_noneCycle 

Time for processing: 0h 0m 0s 0ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_2 

Time for processing: 0h 0m 0s 0ms
cyclecheck /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_3 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_3_cycle --max-seq-len 65535 --chop-cycle 1 --threads 2 -v 3 

Time for merging to assembly_3_cycle: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 0ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_2_cycle 

Time for processing: 0h 0m 0s 0ms
STEP: 4
kmermatcher /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_3_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_4 --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 5 --min-seq-id 0.97 --kmer-per-seq 60 --kmer-per-seq-scale 0.1 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 22 -c 0 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 1 --ignore-multi-kmer 1 --threads 2 --compressed 0 -v 3 

kmermatcher /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_3_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_4 --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 5 --min-seq-id 0.97 --kmer-per-seq 60 --kmer-per-seq-scale 0.1 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 22 -c 0 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 1 --ignore-multi-kmer 1 --threads 2 --compressed 0 -v 3 

Database size: 4 type: Nucleotide

Estimated memory consumption 0 MB
Generate k-mers list for 1 split
[=================================================================] 100.00% 4 0s 0ms      

Adjusted k-mer length 22
Sort kmer 0h 0m 0s 0ms
Sort by rep. sequence 0h 0m 0s 0ms
Time for fill: 0h 0m 0s 0ms
Time for merging to pref_4: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_3 

Time for processing: 0h 0m 0s 0ms
rescorediagonal /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_3_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_3_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_4 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_4 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 3 --wrapped-scoring 0 --filter-hits 0 -e 1e-05 -c 0 -a 0 --cov-mode 0 --min-seq-id 0.97 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 2 --compressed 0 -v 3 

[=================================================================] 100.00% 4 0s 0ms      
Time for merging to aln_4: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_3 

Time for processing: 0h 0m 0s 0ms
assembleresults /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_3_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_4 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_4 --min-seq-id 0.97 --max-seq-len 65535 --threads 2 -v 3 --rescore-mode 3 

Compute assembly.
[=================================================================] 100.00% 4 0s 0ms      
Time for merging to assembly_4: 0h 0m 0s 0ms

Done.
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_3_noneCycle 

Time for processing: 0h 0m 0s 0ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_3 

Time for processing: 0h 0m 0s 0ms
cyclecheck /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_4 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_4_cycle --max-seq-len 65535 --chop-cycle 1 --threads 2 -v 3 

Time for merging to assembly_4_cycle: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_3_cycle 

Time for processing: 0h 0m 0s 0ms
STEP: 5
kmermatcher /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_4_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_5 --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 5 --min-seq-id 0.97 --kmer-per-seq 60 --kmer-per-seq-scale 0.1 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 22 -c 0 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 1 --ignore-multi-kmer 1 --threads 2 --compressed 0 -v 3 

kmermatcher /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_4_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_5 --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 5 --min-seq-id 0.97 --kmer-per-seq 60 --kmer-per-seq-scale 0.1 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 22 -c 0 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 1 --ignore-multi-kmer 1 --threads 2 --compressed 0 -v 3 

Database size: 4 type: Nucleotide

Estimated memory consumption 0 MB
Generate k-mers list for 1 split
[=================================================================] 100.00% 4 0s 0ms      

Adjusted k-mer length 22
Sort kmer 0h 0m 0s 0ms
Sort by rep. sequence 0h 0m 0s 0ms
Time for fill: 0h 0m 0s 0ms
Time for merging to pref_5: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_4 

Time for processing: 0h 0m 0s 0ms
rescorediagonal /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_4_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_4_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_5 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_5 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 3 --wrapped-scoring 0 --filter-hits 0 -e 1e-05 -c 0 -a 0 --cov-mode 0 --min-seq-id 0.97 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 2 --compressed 0 -v 3 

[=================================================================] 100.00% 4 0s 0ms      
Time for merging to aln_5: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_4 

Time for processing: 0h 0m 0s 0ms
assembleresults /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_4_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_5 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_5 --min-seq-id 0.97 --max-seq-len 65535 --threads 2 -v 3 --rescore-mode 3 

Compute assembly.
[=================================================================] 100.00% 4 0s 0ms      
Time for merging to assembly_5: 0h 0m 0s 0ms

Done.
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_4_noneCycle 

Time for processing: 0h 0m 0s 0ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_4 

Time for processing: 0h 0m 0s 0ms
cyclecheck /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_5 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_5_cycle --max-seq-len 65535 --chop-cycle 1 --threads 2 -v 3 

Time for merging to assembly_5_cycle: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 0ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_4_cycle 

Time for processing: 0h 0m 0s 0ms
STEP: 6
kmermatcher /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_5_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_6 --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 5 --min-seq-id 0.97 --kmer-per-seq 60 --kmer-per-seq-scale 0.1 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 22 -c 0 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 1 --ignore-multi-kmer 1 --threads 2 --compressed 0 -v 3 

kmermatcher /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_5_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_6 --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 5 --min-seq-id 0.97 --kmer-per-seq 60 --kmer-per-seq-scale 0.1 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 22 -c 0 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 1 --ignore-multi-kmer 1 --threads 2 --compressed 0 -v 3 

Database size: 4 type: Nucleotide

Estimated memory consumption 0 MB
Generate k-mers list for 1 split
[=================================================================] 100.00% 4 0s 0ms      

Adjusted k-mer length 22
Sort kmer 0h 0m 0s 0ms
Sort by rep. sequence 0h 0m 0s 0ms
Time for fill: 0h 0m 0s 0ms
Time for merging to pref_6: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 2ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_5 

Time for processing: 0h 0m 0s 0ms
rescorediagonal /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_5_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_5_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_6 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_6 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 3 --wrapped-scoring 0 --filter-hits 0 -e 1e-05 -c 0 -a 0 --cov-mode 0 --min-seq-id 0.97 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 2 --compressed 0 -v 3 

[=================================================================] 100.00% 4 0s 0ms      
Time for merging to aln_6: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_5 

Time for processing: 0h 0m 0s 0ms
assembleresults /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_5_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_6 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_6 --min-seq-id 0.97 --max-seq-len 65535 --threads 2 -v 3 --rescore-mode 3 

Compute assembly.
[=================================================================] 100.00% 4 0s 0ms      
Time for merging to assembly_6: 0h 0m 0s 0ms

Done.
Time for processing: 0h 0m 0s 2ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_5_noneCycle 

Time for processing: 0h 0m 0s 0ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_5 

Time for processing: 0h 0m 0s 0ms
cyclecheck /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_6 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_6_cycle --max-seq-len 65535 --chop-cycle 1 --threads 2 -v 3 

Time for merging to assembly_6_cycle: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 0ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_5_cycle 

Time for processing: 0h 0m 0s 0ms
STEP: 7
kmermatcher /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_6_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_7 --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 5 --min-seq-id 0.97 --kmer-per-seq 60 --kmer-per-seq-scale 0.1 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 22 -c 0 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 1 --ignore-multi-kmer 1 --threads 2 --compressed 0 -v 3 

kmermatcher /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_6_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_7 --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 5 --min-seq-id 0.97 --kmer-per-seq 60 --kmer-per-seq-scale 0.1 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 22 -c 0 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 1 --ignore-multi-kmer 1 --threads 2 --compressed 0 -v 3 

Database size: 4 type: Nucleotide

Estimated memory consumption 0 MB
Generate k-mers list for 1 split
[=================================================================] 100.00% 4 0s 0ms      

Adjusted k-mer length 22
Sort kmer 0h 0m 0s 0ms
Sort by rep. sequence 0h 0m 0s 0ms
Time for fill: 0h 0m 0s 0ms
Time for merging to pref_7: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_6 

Time for processing: 0h 0m 0s 0ms
rescorediagonal /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_6_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_6_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_7 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_7 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 3 --wrapped-scoring 0 --filter-hits 0 -e 1e-05 -c 0 -a 0 --cov-mode 0 --min-seq-id 0.97 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 2 --compressed 0 -v 3 

[=================================================================] 100.00% 4 0s 0ms      
Time for merging to aln_7: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_6 

Time for processing: 0h 0m 0s 0ms
assembleresults /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_6_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_7 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_7 --min-seq-id 0.97 --max-seq-len 65535 --threads 2 -v 3 --rescore-mode 3 

Compute assembly.
[=================================================================] 100.00% 4 0s 0ms      
Time for merging to assembly_7: 0h 0m 0s 0ms

Done.
Time for processing: 0h 0m 0s 0ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_6_noneCycle 

Time for processing: 0h 0m 0s 0ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_6 

Time for processing: 0h 0m 0s 0ms
cyclecheck /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_7 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_7_cycle --max-seq-len 65535 --chop-cycle 1 --threads 2 -v 3 

Time for merging to assembly_7_cycle: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 2ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_6_cycle 

Time for processing: 0h 0m 0s 0ms
STEP: 8
kmermatcher /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_7_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_8 --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 5 --min-seq-id 0.97 --kmer-per-seq 60 --kmer-per-seq-scale 0.1 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 22 -c 0 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 1 --ignore-multi-kmer 1 --threads 2 --compressed 0 -v 3 

kmermatcher /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_7_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_8 --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 5 --min-seq-id 0.97 --kmer-per-seq 60 --kmer-per-seq-scale 0.1 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 22 -c 0 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 1 --ignore-multi-kmer 1 --threads 2 --compressed 0 -v 3 

Database size: 4 type: Nucleotide

Estimated memory consumption 0 MB
Generate k-mers list for 1 split
[=================================================================] 100.00% 4 0s 0ms      

Adjusted k-mer length 22
Sort kmer 0h 0m 0s 0ms
Sort by rep. sequence 0h 0m 0s 0ms
Time for fill: 0h 0m 0s 0ms
Time for merging to pref_8: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_7 

Time for processing: 0h 0m 0s 0ms
rescorediagonal /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_7_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_7_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_8 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_8 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 3 --wrapped-scoring 0 --filter-hits 0 -e 1e-05 -c 0 -a 0 --cov-mode 0 --min-seq-id 0.97 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 2 --compressed 0 -v 3 

[=================================================================] 100.00% 4 0s 0ms      
Time for merging to aln_8: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_7 

Time for processing: 0h 0m 0s 0ms
assembleresults /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_7_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_8 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_8 --min-seq-id 0.97 --max-seq-len 65535 --threads 2 -v 3 --rescore-mode 3 

Compute assembly.
[=================================================================] 100.00% 4 0s 0ms      
Time for merging to assembly_8: 0h 0m 0s 0ms

Done.
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_7_noneCycle 

Time for processing: 0h 0m 0s 0ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_7 

Time for processing: 0h 0m 0s 0ms
cyclecheck /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_8 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_8_cycle --max-seq-len 65535 --chop-cycle 1 --threads 2 -v 3 

Time for merging to assembly_8_cycle: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_7_cycle 

Time for processing: 0h 0m 0s 0ms
STEP: 9
kmermatcher /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_8_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_9 --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 5 --min-seq-id 0.97 --kmer-per-seq 60 --kmer-per-seq-scale 0.1 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 22 -c 0 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 1 --ignore-multi-kmer 1 --threads 2 --compressed 0 -v 3 

kmermatcher /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_8_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_9 --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 5 --min-seq-id 0.97 --kmer-per-seq 60 --kmer-per-seq-scale 0.1 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 22 -c 0 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 1 --ignore-multi-kmer 1 --threads 2 --compressed 0 -v 3 

Database size: 4 type: Nucleotide

Estimated memory consumption 0 MB
Generate k-mers list for 1 split
[=================================================================] 100.00% 4 0s 0ms      

Adjusted k-mer length 22
Sort kmer 0h 0m 0s 0ms
Sort by rep. sequence 0h 0m 0s 0ms
Time for fill: 0h 0m 0s 0ms
Time for merging to pref_9: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_8 

Time for processing: 0h 0m 0s 0ms
rescorediagonal /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_8_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_8_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_9 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_9 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 3 --wrapped-scoring 0 --filter-hits 0 -e 1e-05 -c 0 -a 0 --cov-mode 0 --min-seq-id 0.97 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 2 --compressed 0 -v 3 

[=================================================================] 100.00% 4 0s 0ms      
Time for merging to aln_9: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 2ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_8 

Time for processing: 0h 0m 0s 0ms
assembleresults /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_8_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_9 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_9 --min-seq-id 0.97 --max-seq-len 65535 --threads 2 -v 3 --rescore-mode 3 

Compute assembly.
[=================================================================] 100.00% 4 0s 0ms      
Time for merging to assembly_9: 0h 0m 0s 0ms

Done.
Time for processing: 0h 0m 0s 0ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_8_noneCycle 

Time for processing: 0h 0m 0s 0ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_8 

Time for processing: 0h 0m 0s 0ms
cyclecheck /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_9 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_9_cycle --max-seq-len 65535 --chop-cycle 1 --threads 2 -v 3 

Time for merging to assembly_9_cycle: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_8_cycle 

Time for processing: 0h 0m 0s 0ms
STEP: 10
kmermatcher /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_9_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_10 --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 5 --min-seq-id 0.97 --kmer-per-seq 60 --kmer-per-seq-scale 0.1 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 22 -c 0 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 1 --ignore-multi-kmer 1 --threads 2 --compressed 0 -v 3 

kmermatcher /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_9_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_10 --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 5 --min-seq-id 0.97 --kmer-per-seq 60 --kmer-per-seq-scale 0.1 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 22 -c 0 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 1 --ignore-multi-kmer 1 --threads 2 --compressed 0 -v 3 

Database size: 4 type: Nucleotide

Estimated memory consumption 0 MB
Generate k-mers list for 1 split
[=================================================================] 100.00% 4 0s 0ms      

Adjusted k-mer length 22
Sort kmer 0h 0m 0s 0ms
Sort by rep. sequence 0h 0m 0s 0ms
Time for fill: 0h 0m 0s 0ms
Time for merging to pref_10: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_9 

Time for processing: 0h 0m 0s 0ms
rescorediagonal /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_9_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_9_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_10 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_10 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 3 --wrapped-scoring 0 --filter-hits 0 -e 1e-05 -c 0 -a 0 --cov-mode 0 --min-seq-id 0.97 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 2 --compressed 0 -v 3 

[=================================================================] 100.00% 4 0s 0ms      
Time for merging to aln_10: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_9 

Time for processing: 0h 0m 0s 0ms
assembleresults /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_9_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_10 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_10 --min-seq-id 0.97 --max-seq-len 65535 --threads 2 -v 3 --rescore-mode 3 

Compute assembly.
[=================================================================] 100.00% 4 0s 0ms      
Time for merging to assembly_10: 0h 0m 0s 0ms

Done.
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_9_noneCycle 

Time for processing: 0h 0m 0s 0ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_9 

Time for processing: 0h 0m 0s 0ms
cyclecheck /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_10 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_10_cycle --max-seq-len 65535 --chop-cycle 1 --threads 2 -v 3 

Time for merging to assembly_10_cycle: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_9_cycle 

Time for processing: 0h 0m 0s 0ms
STEP: 11
kmermatcher /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_10_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_11 --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 5 --min-seq-id 0.97 --kmer-per-seq 60 --kmer-per-seq-scale 0.1 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 22 -c 0 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 1 --ignore-multi-kmer 1 --threads 2 --compressed 0 -v 3 

kmermatcher /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_10_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_11 --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 5 --min-seq-id 0.97 --kmer-per-seq 60 --kmer-per-seq-scale 0.1 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 22 -c 0 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 1 --ignore-multi-kmer 1 --threads 2 --compressed 0 -v 3 

Database size: 4 type: Nucleotide

Estimated memory consumption 0 MB
Generate k-mers list for 1 split
[=================================================================] 100.00% 4 0s 0ms      

Adjusted k-mer length 22
Sort kmer 0h 0m 0s 0ms
Sort by rep. sequence 0h 0m 0s 0ms
Time for fill: 0h 0m 0s 0ms
Time for merging to pref_11: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_10 

Time for processing: 0h 0m 0s 0ms
rescorediagonal /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_10_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_10_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/pref_11 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_11 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 3 --wrapped-scoring 0 --filter-hits 0 -e 1e-05 -c 0 -a 0 --cov-mode 0 --min-seq-id 0.97 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 2 --compressed 0 -v 3 

[=================================================================] 100.00% 4 0s 0ms      
Time for merging to aln_11: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 2ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_10 

Time for processing: 0h 0m 0s 0ms
assembleresults /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_10_noneCycle /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/aln_11 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_11 --min-seq-id 0.97 --max-seq-len 65535 --threads 2 -v 3 --rescore-mode 3 

Compute assembly.
[=================================================================] 100.00% 4 0s 0ms      
Time for merging to assembly_11: 0h 0m 0s 0ms

Done.
Time for processing: 0h 0m 0s 1ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_10_noneCycle 

Time for processing: 0h 0m 0s 0ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_10 

Time for processing: 0h 0m 0s 0ms
cyclecheck /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_11 /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_11_cycle --max-seq-len 65535 --chop-cycle 1 --threads 2 -v 3 

Time for merging to assembly_11_cycle: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 0ms
rmdb /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_10_cycle 

Time for processing: 0h 0m 0s 0ms
Tmp /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/clu_tmp folder does not exist or is not a directory.
Create dir /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/clu_tmp
linclust /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_final /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/clu /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/clu_tmp --alph-size 5 -k 22 --kmer-per-seq 60 --kmer-per-seq-scale 0.1 --min-seq-id 0.97 --cov-mode 1 -c 0.99 --wrapped-scoring 1 

Set cluster mode GREEDY MEM.
kmermatcher /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_final /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/clu_tmp/8372675225677313391/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 5 --min-seq-id 0.97 --kmer-per-seq 60 --kmer-per-seq-scale 0.1 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 1 -k 22 -c 0.99 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 4 --compressed 0 -v 3 

kmermatcher /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_final /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/clu_tmp/8372675225677313391/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 5 --min-seq-id 0.97 --kmer-per-seq 60 --kmer-per-seq-scale 0.1 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 1 -k 22 -c 0.99 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 4 --compressed 0 -v 3 

Database size: 0 type: Nucleotide

Estimated memory consumption 0 MB
Time for fill: 0h 0m 0s 0ms
Time for merging to pref: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 1ms
rescorediagonal /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_final /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/assembly_final /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/clu_tmp/8372675225677313391/pref /home/mabuelanin/Downloads/bug_tmp/tmp/12805131151232307666/clu_tmp/8372675225677313391/pref_rescore1 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 0 --wrapped-scoring 1 --filter-hits 0 -e 0.001 -c 0.99 -a 0 --cov-mode 1 --min-seq-id 0.97 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 4 --compressed 0 -v 3

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Plass Version: 3.764a3
Conda package
Server specifications (especially CPU support for AVX2/SSE and amount of system memory): Tried on different systems with the same result. Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz 64 GB RAM - Intel(R) Core(TM) i7-4510U CPU @ 2.00GHz 16 GB RAM
Operating system and version: Ubuntu 16.04

mmseqs extractorfs can tranlate orfs but CLI does not allow to specify tranlation table

mmseqs extractorfs can tranlate orfs but cli does not allow to specify tranlation table

Unless this is a hidden option.

mem or disk issue?

Current Behavior

Plass died. I am unsure whether this is due to a RAM issue or tmp space issue.
Server: 512GB Ubuntu1604.

Failed to mmap memory dataSize=0 File=/tmp/6803214812655189031/nucl_6f_long. Error 22.

Thanks

Steps to Reproduce (for bugs)

srun -c 48 /mnt/ngsnfs/tools/plass/plass/bin/plass assemble --threads 48 MBCF_117_S38_R1.fastq out.fa /tmp/

Plass Output (for bugs)

Program call:
assemble --threads 48 MBCF_117_S38_R1.fastq out.fa /tmp/

MMseqs Version: 26b5d66
Sub Matrix blosum62.out
Rescore mode 0
Remove hits by seq.id. and coverage false
E-value threshold 1e-05
Coverage threshold 0
Coverage Mode 0
Seq. Id Threshold 0.9
Seq. Id. Mode 0
Include identical Seq. Id. false
Sort results 0
In substitution scoring mode, performs global alignment along the diagonal false
Preload mode 0
Threads 48
Verbosity 3
Alphabet size 13
Kmer per sequence 60
Mask Residues 0
K-mer size 14
Max. sequence length 65535
Shift hash 5
Split Memory Limit 0
Include only extendable true
Skip sequence with n repeating k-mers 8
Min codons in orf 45
Max codons in length 2147483647
Max orf gaps 2147483647
Contig start mode 2
Contig end mode 2
Orf start mode 0
Forward Frames 1,2,3
Reverse Frames 1,2,3
Translation Table 1
Use all table starts false
Offset of numeric ids 0
Protein Filter Threshold 0.2
Filter Proteins 1
Number search iterations 12
Remove Temporary Files false
Sets the MPI runner

Program call:
createdb MBCF_117_S38_R1.fastq /tmp/6803214812655189031/nucl_reads --max-seq-len 65535 --dont-split-seq-by-len 0 --dont-shuffle 1 --id-offset 0 -v 3

MMseqs Version: 26b5d66
Max. sequence length 65535
Split Seq. by len false
Do not shuffle input database true
Offset of numeric ids 0
Verbosity 3

................................................................................................... 1 Mio. sequences processed
................................................................................................... 2 Mio. sequences processed
................................................................................................... 3 Mio. sequences processed
................................................................................................... 4 Mio. sequences processed
................................................................................................... 5 Mio. sequences processed
................................................................................................... 6 Mio. sequences processed
................................................................................................... 7 Mio. sequences processed
................................................................................................... 8 Mio. sequences processed
................................................................................................... 9 Mio. sequences processed
................................................................................................... 10 Mio. sequences processed
................................................................................................... 11 Mio. sequences processed
................................................................................................... 12 Mio. sequences processed
................................................................................................... 13 Mio. sequences processed
................................................................................................... 14 Mio. sequences processed
................................................................................................... 15 Mio. sequences processed
................................................................................................... 16 Mio. sequences processed
...........Time for merging files: 0h 0m 2s 140ms
Time for merging files: 0h 0m 2s 28ms
Touch data file /tmp/6803214812655189031/nucl_reads ... Done.
Time for merging files: 0h 0m 15s 353ms
Touch data file /tmp/6803214812655189031/nucl_reads_h ... Done.
Time for merging files: 0h 0m 15s 312ms
Time for processing: 0h 1m 55s 831ms
Program call:
extractorfs /tmp/6803214812655189031/nucl_reads /tmp/6803214812655189031/nucl_6f_start --min-length 20 --max-length 45 --max-gaps 0 --contig-start-mode 1 --contig-end-mode 0 --orf-start-mode 0 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --use-all-table-starts 0 --id-offset 0 --threads 48 -v 3

MMseqs Version: 26b5d66
Min codons in orf 20
Max codons in length 45
Max orf gaps 0
Contig start mode 1
Contig end mode 0
Orf start mode 0
Forward Frames 1,2,3
Reverse Frames 1,2,3
Translation Table 1
Use all table starts false
Offset of numeric ids 0
Threads 48
Verbosity 3

................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................ 16 Mio. sequences processed
................................................................................................................................................................................................................................................................................................................................................................................................................................. 10 Mio. sequences processed
..... 8 Mio. sequences processed
................................................................................................................................................. 14 Mio. sequences processed
.. 15 Mio. sequences processed
. 13 Mio. sequences processed
....... 7 Mio. sequences processed
...................................... 11 Mio. sequences processed
........... 12 Mio. sequences processed
............................................................................................ 9 Mio. sequences processed
................................ 6 Mio. sequences processed
........................ 5 Mio. sequences processed
...... 1 Mio. sequences processed
.......................................... 3 Mio. sequences processed
.................... 2 Mio. sequences processed
4 Mio. sequences processed
.................................Time for merging files: 0h 0m 0s 96ms
Time for merging files: 0h 0m 0s 95ms
Time for processing: 0h 0m 5s 85ms
Program call:
translatenucs /tmp/6803214812655189031/nucl_6f_start /tmp/6803214812655189031/aa_6f_start --translation-table 1 --add-orf-stop 1 -v 3 --threads 48

MMseqs Version: 26b5d66
Translation Table 1
Add Orf Stop true
Verbosity 3
Threads 48

...............................Time for merging files: 0h 0m 0s 202ms
Time for processing: 0h 0m 0s 452ms
Program call:
extractorfs /tmp/6803214812655189031/nucl_reads /tmp/6803214812655189031/nucl_6f_long --min-length 45 --max-length 2147483647 --max-gaps 0 --contig-start-mode 2 --contig-end-mode 2 --orf-start-mode 0 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --use-all-table-starts 0 --id-offset 0 --threads 48 -v 3

MMseqs Version: 26b5d66
Min codons in orf 45
Max codons in length 2147483647
Max orf gaps 0
Contig start mode 2
Contig end mode 2
Orf start mode 0
Forward Frames 1,2,3
Reverse Frames 1,2,3
Translation Table 1
Use all table starts false
Offset of numeric ids 0
Threads 48
Verbosity 3

............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................ 16 Mio. sequences processed
............................................................................................................................................................................................................................................................................................................................................................................................. 3 Mio. sequences processed
........................... 14 Mio. sequences processed
........................................... 15 Mio. sequences processed
.................................................... 13 Mio. sequences processed
......................................... 2 Mio. sequences processed
..... 11 Mio. sequences processed
............................... 9 Mio. sequences processed
............................................................... 8 Mio. sequences processed
..... 5 Mio. sequences processed
............................ 6 Mio. sequences processed
.................... 10 Mio. sequences processed
.......................................................................................................... 12 Mio. sequences processed
................ 1 Mio. sequences processed
.. 7 Mio. sequences processed
..................................... 4 Mio. sequences processed
......Time for merging files: 0h 0m 0s 1ms
Time for merging files: 0h 0m 0s 1ms
Time for processing: 0h 0m 4s 905ms
Program call:
translatenucs /tmp/6803214812655189031/nucl_6f_long /tmp/6803214812655189031/aa_6f_long --translation-table 1 --add-orf-stop 1 -v 3 --threads 48

MMseqs Version: 26b5d66
Translation Table 1
Add Orf Stop true
Verbosity 3
Threads 48

Failed to mmap memory dataSize=0 File=/tmp/6803214812655189031/nucl_6f_long. Error 22.
Error: translatenucs long step died
srun: error: hpc-rc03: task 0: Exited with exit code 1

Using more stringent parameters to avoid spurious sequences

Hey there,

I've been testing some PLASS assemblies with my datasets and I noticed that it retrieves ~50x times more proteins than my MEGAHIT+Prodigal predictions, which is way more than what is shown in Fig S2 of the paper.

I ran PLASS with default parameters, which means that a lot of very short peptides were retrieved (as the default value of --min-length is 45). I also noticed that most of them are incomplete (without * at the beginning and the end of the sequence). Even though partial peptides are to be expected in metagenome assemblies, I'm concerned that PLASS may be giving me a substantial amount of spurious sequences. In Swiss-Prot, ~8% of the prokaryotic proteins are between 45 and 100 residues. In my assemblies, 40% of the sequences fall into this interval.

In the paper, you excluded peptides shorter than 100 residues (which is something that I'd expect to attain by using --min-length 100), but I'm apprehensive of doing that because I don't want to left short bona-fide proteins out of my assembly.

Do you think raising the neural network threshold is a sound idea to solve the problem of spurious sequences? How was the default threshold (0.2) determined?

Empty (len:0) sequences in plass output

Expected Behavior

Do not write sequences in the output shorter than --min-length, which is 45 aa by default.

Current Behavior

Sequences shorter than --min-length are being written in the output, even empty ones (len:0).

Steps to Reproduce (for bugs)

plass \
assemble \
 ${FW} \
 ${RV} \
 ${FASTA_OUT} \
 ${TMP_DIR}

Plass Output (for bugs)

General output: https://gist.github.com/aleixop/76bd8e2fc4e9a88ba7072f470abbc600

Context

Co-assembly of ~300M PE reads with default parameters that runs smoothly without errors.
250 Gb of RAM and 48 cpus.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Git commit used (The string after "Plass Version:" when you execute Plass without any parameters): 5d03cce371dc51c23652a251550c33fd0358690d
Which Plass version was used (Statically-compiled, self-compiled, Homebrew, etc.): Statically-compiled
Server specifications (especially CPU support for AVX2/SSE and amount of system memory):AVX2 supported

Combining single and paired reads

Just wondering, is there any way to combine single and paired reads from the same library into a single assembly?

Can not allocate index memory in DBReader

Hi I am Kihyun,

I ran the assemble command like this:
plass assemble input/I_1.fastq.gz input/I_2.fastq.gz plass_proteome/I.plass.fasta plass_tmp --threads 6 --remove-tmp-files --max-seq-len 30000

The run ended up with an error message, like this (I've erased some lengthy dir path parts {...}):

Temporary folder plass_tmp does not exist or is not a directory.
Created directory plass_tmp
PAIRED END MODE
mergereads input/I_1.fastq.gz input/I_2.fastq.gz {...}/plass_tmp/2685330570646735821/nucl_reads -v 3

Start merging reads.
Time for merging files: 0h 4m 52s 502ms
Time for merging files: 0h 1m 55s 7ms

Done.
Time for processing: 0h 25m 6s 66ms
extractorfs {...}/plass_tmp/2685330570646735821/nucl_reads {...}/plass_tmp/2685330570646735821/nucl_6f_start --min-length 20 --max-length 45 --max-gaps 0 --contig-start-mode 1 --contig-end-mode 0 --orf-start-mode 0 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --use-all-table-starts 0 --id-offset 0 --threads 6 --compressed 0 -v 3

[=================================================================] 100.00% 100.03M 6m 55s 977ms
Time for merging files: 0h 0m 15s 98ms
Time for merging files: 0h 0m 52s 8ms
Time for processing: 0h 8m 53s 976ms

translatenucs {...}/plass_tmp/2685330570646735821/nucl_6f_start {...}/plass_tmp/2685330570646735821/aa_6f_start --translation-table 1 --add-orf-stop 1 -v 3 --compressed 0 --threads 6

[=================================================================] 100.00% 27.10M 11s 751ms
Time for merging files: 0h 0m 18s 606ms
Time for processing: 0h 0m 37s 850ms

extractorfs {...}/plass_tmp/2685330570646735821/nucl_reads {...}/plass_tmp/2685330570646735821/nucl_6f_long --min-length 45 --max-length 32734 --max-gaps 0 --contig-start-mode 2 --contig-end-mode 2 --orf-start-mode 0 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --use-all-table-starts 0 --id-offset 0 --threads 6 --compressed 0 -v 3

[=================================================================] 100.00% 100.03M 12m 28s 264ms
Time for merging files: 0h 3m 44s 399ms
Time for merging files: 0h 10m 38s 798ms
Can not allocate index memory in DBReader
Error: extractorfs longest step died

My plass version is 1667488
and I installed this as suggested

# latest static linux build s
 wget https://mmseqs.com/plass/plass-static_sse41.tar.gz; tar xvfz plass-static_sse41.tar.gz; export PATH=$(pwd)/plass/bin/:$PATH

I wonder what causes this error. In my guess, just because my server doesn't have enough memory to process the size of input data? Or according to that "Plass needs a CPU with at least the SSE4.1 instruction set."
Buy the way, the size of input data that I used were
I_1.fastq.gz 4.7G
I_2.fastq.gz 5.4G

For more information on my working environment, the linux is CentOS Linux release 7.5.1804 (Core). I am not an familiar with CPU hardware so I can't be sure but according to that /proc/cpuinfo file contains "sse4_1" "sse4_2" in its lines starting with "flags : ", I assume that these CPUs do support SSE4.1?

Thanks,
Kihyun

Recurrently getting "Kmer matching step died"

I'm recurrently getting the "Kmer matching step died" error during my assemblies. I wasn't able to pinpoint what may be causing it because I sometimes get it and sometimes don't even when I'm assembling the same data and using the same parameters.

PLASS Version: a983491

Ubuntu 18.04, 184 GB of memory, 96 × Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz

strict_assembly.log

Automated Docker images with sha256 tags

Expected Behavior

The ideal source for reproducible Docker images is a docker server that allows you to request images by sha256 hash. Quay provides that service with automated integration with GitHub.

Current Behavior

The only issue with the Docker image currently hosted on dockerhub is that users cannot pull based on sha256 hash, and so you cannot achieve the guaranteed reproducibility by knowing that two versions pulled down are 100% identical.

Solution

It is rather easy to set up a repository on Quay that automatically builds from a Dockerfile within a GitHub repo on each commit (or release). If you were to set that up once, then users could pull down a single canonical Docker image for each release of Plass. See the docs for more details.