gatb / simka Goto Github PK

View Code? Open in Web Editor NEW

44.0 5.0 10.0 5.62 MB

Simka and SimkaMin are comparative metagenomics methods dedicated to NGS datasets.

Home Page: https://gatb.inria.fr/software/simka/

License: GNU Affero General Public License v3.0

CMake 1.22% Shell 4.58% Python 9.50% R 2.60% C++ 80.38% C 0.24% Dockerfile 1.48%

simka's Introduction

Table of contents

[[TOC]]

Simka & SimkaMin

This directory stores Simka and SimkaMin software. This readme focuses on Simka features. All information about SimkaMin is located in the simkaMin directory.

Continuous integration status (master branch)

Build status

Linux	Mac OSX

SonarQube metrics

Click me to expand

What is Simka?

Simka is a de novo comparative metagenomics tool. Simka represents each dataset as a k-mer spectrum and compute several classical ecological distances between them.

Developper: Gaëtan Benoit, PhD, former member of the Genscale team at Inria.

Contact: claire dot lemaitre at inria dot fr

References

Simka: Benoit G, Peterlongo P, Mariadassou M, Drezen E, Schbath S, Lavenier D, Lemaitre C. (2016) Multiple comparative metagenomics using multiset k-mer counting. PeerJ Computer Science 2:e94
SimkaMin:

Gaetan Benoit, Mahendra Mariadassou, Stéphane Robin, Sophie Schbath, Pierre Peterlongo, Claire Lemaitre SimkaMin: fast and resource frugal de novo comparative metagenomics Bioinformatics, https://doi.org/10.1093/bioinformatics/btz685
Benoit G (2017) Large scale de novo comparative metagenomics (PhD thesis in french).

Install a binary release of simka

Retrieve the binary archive file from one of the official simka releases (see "Releases" tab on the Github web page of the simka project); file name is "simka-xyz-bin-Darwin.tar.gz" or "simka-xyz-bin-Linux.tar.gz" (where xyz is a release version).

Then, from the command-line:

gunzip simka-xyz-bin-Dawrin.tar.gz
tar -xf simka-xyz-bin-Dawrin.tar
cd simka-xyz-Dawrin
chmod +x bin/* example/*.sh

Binary of simka is in folder "bin". You can try the software on your computer, as follows:

cd example
./simple_test.sh

In case the software does not run appropriately on your system, you should consider to install it from its source code, as explained below.

For further instructions on using simka, see User Manual, below.

Install simka from source code: git clone

Requirements: cmake 2.6+ and gcc 4.4.7+ (Linux) or clang 4.1+ (Mac OSX).

From the command-line:

git clone https://github.com/GATB/simka.git
cd simka
sh INSTALL

See the INSTALL file for more information.

Then, you can try the software on your computer, as follows:

cd example
./simple_test.sh

The installation creates 4 executables (./build/bin directory):

simka: main software to be used for your analysis
simkaCount, simkaMerge and simkaCountProcess: not to be used directly, called by 'simka'

All softwares must stay in the same folder; so, if you want to move them elsewhere on your system, consider to let them altogether.

For further instructions on using simka, see User Manual, below.

Install simka from source code: using a source release archive

Requirements: cmake 2.6+ and gcc 4.5+ (Linux) or clang 4.1+ (Mac OSX).

Retrieve the source code archive file from one of the official simka releases (see "Releases" tab on the Github web page of the simka project); file name is "simka-xyz-Source.tar.gz" (where xyz is a release version).

Then, from the command-line:

gunzip simka-xyz-Source.tar.gz
tar -xf simka-xyz-Source.tar
cd simka-xyz-Source
sh INSTALL

Then, you can try the software on your computer, as follows:

cd example
./simple_test.sh

For further instructions on using simka, see User Manual, below.

Changelog

version 1.5.1 Sept 05, 2019:
- simkaMin: easier usage of simkaMin, usefull for conda packaging
version 1.5 Jun 07, 2019:
- simkaMin software: faster results by subsampling the kmer space
version 1.4 Jun 21, 2017:
- update gatb-core to version 1.2.2
- simka now provide gz compressed results
- new scripts for result visualization
version 1.3.2 Oct 25, 2016:
- improve memory usage of symetrical distances
- option -data-info to compute information on the input data (nb reads per dataset...)
- intermediate merge sort passes to handle large number of datasets
- prevent distances from producing nan value
- fix bug that occur during k-mer counting
version 1.3.0 July 29, 2016:
- Bray-Crutis computed by default
- Better k-mer statistics
- Fix bug in script for creating heatmaps
- Add "all in memory" k-mer counter when k <= 15
- Fine grain paralellization for computing distances
- Clean all memory leaks with valgrind
- Update help messages
- Redirect stdout and stderr of parallel processes in specific log files
version 1.0.1 March 16, 2016: minor updates ang bug fixes, first release on Github
version 1 Feb 16, 2016: stable version
version 0.1 May 28, 2015: initial public release

User manual

Description

Simka computes several classical ecological distances between N (metagenomic) read sets based on k-mer counts. Simka is implemented with the GATB library (http://gatb.inria.fr/).

Input

The input file (-in) lists the datasets. These datasets can be in fasta, fastq and in gzip compressed format (.gz).

One dataset per line with the following syntax (you can put any number of spaces and/or tabs between syntax):

ID1: filename.fasta
ID2: filename.fasta
ID3: filename.fasta

The dataset ID in the name that will appear in the headers of the distance matrices.

You can find a simka input file in example directory: ./example/data/simka_input.txt

If a given datset has been splitted in several parts, Simka can automatically concatenate them.

ID1: filename_part1.fasta , filename_part2.fasta , ...

If you have paired files, you can list them separated by a ‘;’:

ID1: filename_pair1.fasta ; filename_pair2.fasta

You can combine concatenated and paired operations:

ID1: filename_part1_pair1.fasta , filename_part2_pair1.fasta ; filename_part1_pair2.fasta , filename_part2_pair2.fasta

Paired syntax is only usefull if the -max-reads option of Simka is set.

Example:

If -max-reads is set to 100, then Simka will considered the 100 first reads of the first paired files and the 100 first reads of the second paired files…

Output

Temporary output

The option -out-tmp controls where the temporary files of Simka will be stored.

This option is mandatory since the disk usage of Simka can be high depending on the input size.

This option must target a directory on your faster disk with some free space.

One may want to add new datasets to existing Simka results without recomputing everything again (for instance, if your metagenomic project is incomplete). This can only be achieved by keeping those temporary files on the disk using the option -keep-tmp of Simka.

Result output

Simka results are distance matrices. A distance matrix is a squared matrix of size N (where N is the number of input datasets). Each value in the matrix give you the distance between a pair of datasets. These values are usually in the range [0, 1]. A distance value of 0 means that the pair of dataset is perfectly similar. The higher the distance value is, the more dissimilar is the pair of datasets.

Simka results will be stored in the directory indicated by -out option.

By default, Simka compute an abundance-based Bray-Curtis distance matrix and a presence-absence-based Jaccard distance matrix.

The option -simple-dist allows to compute more ecology distances which are fast to compute (Chord, Hellinger, Kulczinski...).

The option -complex-dist allows to compute others ecology distances which can be very long to compute (Jensen-Shannon, Canberra, Whittaker...).

The matrice names follow this template:

mat_[abundance|presenceAbsence]_[distanceName].csv.gz

The distance matrices containing ‘simka’ are distances introduces by the comparead method. These distances have the advantage of having a symmetrical and asymmetrical version.

Visualize simka results

Simka results can be visualized through heatmaps, hierarchical clustering and PCA (MDS or PCoA to be exact).

Requirements: R, gplots package (only for heatmap)

Use the script run-visualization.py (located in "scripts/visualization" folder).

Example:

python run-visualization.py -in simka_results_dir -out output_figures_dir -pca -heatmap -tree

where simka_results_dir is the folder containing the distances matrices of Simka (-out)

Figures can be annotated by providing a metadata data in standard csv format:

DATASET_ID;VARIABLE_NAME_1;VARIABLE_NAME_2
A;1;aquatic
B;1;human
C;2;human
D;2;soil
E;3;soil

An example of this table is given at ./example/dataset_metadata.csv

Dataset ID in the metadata table must match with the dataset ID in simka distance matrices

Add the following options to activate annotations:

-metadata-in: filename to a metadata table
-metadata-variable: the name of the variable that you want to display in figures (the name of the column), for instance VARIABLE_NAME_1 in example above

Visualization example commands are given when running simka example (./example/simple_test.sh).

Usage for simka

To see simka in-line help:

./bin/simka

Simka command examples

Run the toy example:

./bin/simka -in example/simka_input.txt -out results -out-tmp temp_output

Compute all the distances that Simka can provide (Bray-Curtis, Jensen-Shannon…):

./bin/simka … -simple-dist -complex-dist

Change the kmer size

./bin/simka … -kmer-size 31

Filter kmers seen one time (potentially erroneous) and very high abundance kmers (potentially contaminants):

./bin/simka … -abundance-min 2 -abundance-max 200

Filter over the sequences of the reads and k-mers:

Minimum read size of 90. Discards low complexity reads and k-mers (shannon index < 1.5)

./bin/simka … -min-read-size 90 -read-shannon-index 1.5 -kmer-shannon-index 1.5

Consider a subset of the reads of the input dataset (for dataset with non-uniform reads per sample):

Considers all the reads of each samples (default)

./bin/simka … -max-reads -1

Let Simka compute automatically the maximum of read per samples (normalization)

./bin/simka … -max-reads 0

Used only the first 1000 reads of each samples:

./bin/simka … -max-reads 1000

Allow more memory and cores improve the execution time:

./bin/simka … -max-memory 20000 -nb-cores 8

Computer cluster options

Simka can be ran on computer cluster equipped of a job scheduling system such as SGE. Giving a job file template and a submission command, Simka will take care of creating and synchronizing the jobs until the end of the execution.

You must provide the filenames to two job templates, one for counting and one for merging (-count-file -count-merge).

There are example of file templates in the folder ‘example/potara_job’.

And you must provide a submission command for both job (-count-cmd -merge-cmd)

Example for SGE:

-count-cmd ‘qsub  -pe make 8’ -merge-cmd qsub

The option -max-count and -max-merge controls the maximum of simultaneous jobs. They have to be fixed if you system have a maximum of jobs restriction.

Command example:

./bin/simka … -count-file example/potara_job/sge/job_count -merge-file example/potara_job/sge/job_merge \
-count-cmd qsub -pe make 34 -merge-cmd qsub \
-max-count 6 -max-merge 18 -nb-cores 200 -max-memory 500000

Simka will run a maximum of 6 simultaneous counting jobs, each using 200/6 cores and 500000/6 MB of memory. Simka will run a maximum of 18 merging jobs. A merging job can not be ran on more than 1 core and use very low memory. By default Simka use -nb-cores/2 counting jobs simultaneously and -nb-cores merging jobs simultaneously.

Possible issues with Simka

TOO MUCH OPENED FILES

Simka is a disk-based method. Depending on the chosen options (-nb-cores -max-memory), it is possible that Simka required a lot of open files.

You can fix this issue in two ways:

increasing the maximum open files limit imposed by your system: ulimit -n maxFiles
reducing the number of files opened by Simka by using the option -max-count and -max-merge

simka's People

Contributors

Stargazers

Watchers

Forkers

aboffin ysard gatouresearch frederic-mahe tlemane pythseq clemaitre jianshu93 gaetanbenoitdev mancheron

simka's Issues

Simka k-mers frequencies table

Hello :)
Thanks to maintain Simka!
Is it possible to obtain the k-mers table counts for each sample analyzed with Simka, to then use it in R to calculate alfa diversity metrics?
Thanks,
L

Installs hdf headers

include/hdf5/H5ACpublic.h
etc.

are installed by hdf5.

Several plist problems: It installs the third party executable 'h5cc', etc.

It reinstalls h5cc that is installed by hdf5.
It installs an unconventional file lib/libhdf5.settings that is most likely misplaced or not needed.
It installs directories that mimic the build directory
Headers are installed into my build directory: /usr/ports/biology/simka/work/.build/ext/gatb-core/include/Release/hdf5/H5ACpublic.h

For example, it installs /usfr/ports/biology/simka/work/.build/ext/gatb-core/include/Release/hdf5 - a directory where I built simka.

run does not finish

Hi,

I'm running simka on around 1000 samples with a total of 45 billion reads.

simka -in simka_input.txt -out results_simka -out-tmp temp_output -simple-dist -max-count 6 -max-merge 18 -nb-cores 112 -max-memory 100000

In the first two days it creates the following folders:

drwxr-xr-t   2 28039 Jun 17 14:32 input
drwxr-xr-t   2     0 Jun 17 14:32 merge_synchro
drwxr-xr-t   2     0 Jun 17 14:32 stats
drwxr-xr-t   2     0 Jun 17 14:32 job_count
drwxr-xr-t   2     0 Jun 17 14:32 job_merge
-rw-r--r--   1 10989 Jun 17 14:32 datasetIds
-rw-r--r--   1 46824 Jun 17 14:38 config.h5
drwxr-xr-t 344  8782 Jun 17 14:38 solid
drwxr-xr-t   2 38016 Jun 19 06:13 log
drwxr-xr-t   7   140 Jun 19 06:15 temp
drwxr-xr-t   2 31850 Jun 19 06:15 kmercount_per_partition
drwxr-xr-t   2 30854 Jun 19 06:15 count_synchro

But since June 19th nothing has happened, but the job is still running. Is this normal? Should I keep waiting?

simple test stuck

My test is stuck since one hour at:

[Merging datasets ] 81.8 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 0.0 % mem: [ 11, 11, 12] MB

I tried the command max-merge 4 but now it is stuck in another spot:
[Merging datasets ] 86.4 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 0.0 % mem: [ 10, 10, 10] MB 21 already merged (remove file /home/bio/Desktop/simka/example/simka_temp_output/simka_output_temp//merge_synchro/21.ok to merge again)

Inches

Dear developers,

I find really strange to set the figures in inches. The SI system is in meters, even NASA performs science in meters.
Can we have at least the option to set it in cm ?

Job does not end

Sometimes, especially when we have a lot of input files, the job stops at the merge stage (v1.3.2 and v1.4.0).

Test regression on Arm64 (Debian Med)

Hello,

Simka has been packaged1 from the Debian Med Team. However, there is a regression only on Arm64 architecture that is stopping unstable to testing migration. I'd rather sort this out before the next Debian release freeze.

Here is the log. The autopkgtest script is here (this is causing a regression and is what gets executed). Any ideas?

Thanks!

simka crashes with empty files

Hi,

Simka crashes with a segfault when using an empty file.

$touch sample.fastq
$echo 'sample: sample.fastq' > simka_in.txt
$/usr/local/bin/simka -in simka_in.txt -out-tmp /tmp

Creating input
        Nb input datasets: 1

Reads per sample used: all


Maximum ressources used by Simka:
         - 1 simultaneous processes for counting the kmers (per job: 16 cores, 5000 MB memory)
         - 16 simultaneous processes for merging the kmer counts (per job: 1 cores, memory undefined)

Segmentation fault

An explicit error message would be welcomed.

Thanks,
Florian

Rscripts fail to generate heatmap images

Hi
I have made a simple comparison between two fasta files. Simka performed well, here is one of the matrix files :

;id1;id2
id1;0.000000;0.991121
id2;0.991121;0.000000

but the scripts fail to create the heatmaps :
python create_heatmaps.py ../build/bin/test.txt/

Error in graphics:::plotHclust(n1, merge, height, order(x$order), hang, :
entrée de dendrogramme incorrecte
Calls: plot -> plot.hclust ->
Exécution arrêtée

Second question :
Concerning the results, is it a factor or a percentage ?

Thanks for the help.

Issues during linking stage

Hello,

I am packaging simka as a debian package1.

Cloning and compiling directly from source completes successfully. However, with regards to debian policy, gatb-core is already available as a debian package in the repository. A debian package library takes precedence over gatb-core being built with simka. As a result, I have made a patch2 to which cmake is adapted to use the system available gatb-core instead of that in thirdparty. This seemingly brings up issues during the linking stage with SimkaAlgorithm.cpp3.

I would be grateful if this patch is looked at or I am pointed into the correct path. I am unsure if gatb-core or some other cmake in gatb-core injects something into simka that causes a successful build.

Thanks,
Shayan Doust

run for multiple samples on different node on super computer

Hi,
I am not sure how to run a lot of samples on different computer node at the same time and then pool those result to calculate distance matrix any idea how?

Thanks,

Jianshu

Include <hdf5/hdf5.h> is wrong, hdf5.h is installed into ${PREFIX}/include/hdf5.h

convert output matrix to triangular format in R

greetings, i enjoy using simka but i have a question regarding manipulating the output files.
specifically, is there a way to convert the 'mat_abundance_braycurtis.csv' into a triangular lower matrix.
i'm really interested in working with a triangular matrix, much the same as that produced by vegan.
for example:

library(vegan)
mat <- matrix(1:9, 3, 3)
mat.dis<-vegdist(mat)
mat.dis
           1          2
2 0.11111111           
3 0.20000000 0.09090909

Fails to run

lala
[0.0%] Computing k-mer spectrums [Time: 0:00:00]A job failed (simka_test_tmp/simka_database/kmer_spectrums/A/), exiting simka

Cannot compile source archive from release page

As it misses the .git folder in the release archive of the source, the commands git submodule init and git submodule update from the INSTALL file fail each with the following error:

fatal: Not a git repository (or any of the parent directories): .git

SimkaMin output file symmetry

Dear SimkaMin Dev,

I recently stumbled upon your Simkamin tool and tried to use it to compare my 4000 datasets against each other to get information on the similarity of these samples.

I found something odd in the output matrices. They don’t seem to be symmetric. Where the upper triangular contains mostly values between 0.0 and 1, the lower triangular matrix contains mostly but not exclusively zeros. I would like understand if the lower triangular matrix would be empty but a non-symmetric output is strange.

In fact it seems that there is always a subpart that is symmetric but its mostly not.

I attached a screenshot of parts of the matrix.

Do you know what to do with this information? Should I only use the column-based distances?

Best and thanks,
Hans

simple test stuck

Hi,
The simple test has been stuck for some hours and all files in the the temporary folder merge_synchro/ are empty. Nothing has changed in my folders since 5 hours ago.
Here is the error file :
^M[Counting datasets ] 0 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: -1.0 % mem: [ 12, 12, 12] MB ^M[Counting datasets ] 20 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 83.3 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 20 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 83.3 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 20 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 83.3 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 20 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 83.3 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 20 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 83.3 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 20 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 83.3 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 20 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 83.3 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 20 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 83.3 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 20 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 83.3 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 20 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 83.3 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 20 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 83.3 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 20 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 80.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 20 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 80.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 20 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 80.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 20 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 80.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 20 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 80.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 20 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 80.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 20 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 80.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 20 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 80.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 40 % elapsed: 0 min 1 sec remaining: 0 min 2 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 40 % elapsed: 0 min 1 sec remaining: 0 min 2 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 40 % elapsed: 0 min 1 sec remaining: 0 min 2 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 40 % elapsed: 0 min 1 sec remaining: 0 min 2 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 40 % elapsed: 0 min 1 sec remaining: 0 min 2 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 40 % elapsed: 0 min 1 sec remaining: 0 min 2 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 40 % elapsed: 0 min 1 sec remaining: 0 min 2 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 40 % elapsed: 0 min 1 sec remaining: 0 min 2 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 40 % elapsed: 0 min 1 sec remaining: 0 min 2 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 40 % elapsed: 0 min 1 sec remaining: 0 min 2 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 40 % elapsed: 0 min 1 sec remaining: 0 min 2 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 40 % elapsed: 0 min 1 sec remaining: 0 min 2 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 40 % elapsed: 0 min 1 sec remaining: 0 min 2 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 40 % elapsed: 0 min 1 sec remaining: 0 min 2 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 40 % elapsed: 0 min 1 sec remaining: 0 min 2 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 40 % elapsed: 0 min 1 sec remaining: 0 min 2 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 40 % elapsed: 0 min 1 sec remaining: 0 min 2 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 40 % elapsed: 0 min 1 sec remaining: 0 min 2 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 40 % elapsed: 0 min 1 sec remaining: 0 min 2 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 40 % elapsed: 0 min 1 sec remaining: 0 min 2 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 60 % elapsed: 0 min 1 sec remaining: 0 min 1 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 60 % elapsed: 0 min 1 sec remaining: 0 min 1 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 60 % elapsed: 0 min 1 sec remaining: 0 min 1 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 60 % elapsed: 0 min 1 sec remaining: 0 min 1 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 60 % elapsed: 0 min 1 sec remaining: 0 min 1 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 60 % elapsed: 0 min 1 sec remaining: 0 min 1 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 60 % elapsed: 0 min 1 sec remaining: 0 min 1 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 60 % elapsed: 0 min 1 sec remaining: 0 min 1 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 60 % elapsed: 0 min 1 sec remaining: 0 min 1 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 60 % elapsed: 0 min 1 sec remaining: 0 min 1 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 60 % elapsed: 0 min 1 sec remaining: 0 min 1 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 60 % elapsed: 0 min 1 sec remaining: 0 min 1 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 60 % elapsed: 0 min 1 sec remaining: 0 min 1 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 60 % elapsed: 0 min 1 sec remaining: 0 min 1 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 60 % elapsed: 0 min 1 sec remaining: 0 min 1 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 60 % elapsed: 0 min 1 sec remaining: 0 min 1 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 60 % elapsed: 0 min 1 sec remaining: 0 min 1 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 60 % elapsed: 0 min 1 sec remaining: 0 min 1 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 60 % elapsed: 0 min 1 sec remaining: 0 min 1 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 60 % elapsed: 0 min 1 sec remaining: 0 min 1 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 80 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 80 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 80 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 80 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 80 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 80 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 80 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 80 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 80 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 80 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 80 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 80 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 80 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 80 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 80 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 80 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 80 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 80 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 80 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 80 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 100 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 100 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 100 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 100 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 100 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 100 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 100 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 100 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 100 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 100 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 100 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 100 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 100 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 100 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 100 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 100 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 100 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 100 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 100 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Counting datasets ] 100 % elapsed: 0 min 1 sec remaining: 0 min 0 sec cpu: 16.0 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 0 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: -1.0 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 2.08 % elapsed: 0 min 0 sec remaining: 0 min 10 sec cpu: 19.0 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 2.08 % elapsed: 0 min 0 sec remaining: 0 min 10 sec cpu: 19.0 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 4.17 % elapsed: 0 min 0 sec remaining: 0 min 5 sec cpu: 19.0 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 4.17 % elapsed: 0 min 0 sec remaining: 0 min 5 sec cpu: 19.0 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 6.25 % elapsed: 0 min 0 sec remaining: 0 min 3 sec cpu: 19.0 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 6.25 % elapsed: 0 min 0 sec remaining: 0 min 3 sec cpu: 19.0 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 8.33 % elapsed: 0 min 0 sec remaining: 0 min 2 sec cpu: 19.0 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 8.33 % elapsed: 0 min 0 sec remaining: 0 min 2 sec cpu: 19.0 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 10.4 % elapsed: 0 min 0 sec remaining: 0 min 2 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 10.4 % elapsed: 0 min 0 sec remaining: 0 min 2 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 12.5 % elapsed: 0 min 0 sec remaining: 0 min 2 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 12.5 % elapsed: 0 min 0 sec remaining: 0 min 2 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 14.6 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 14.6 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 16.7 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 16.7 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 18.8 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 18.8 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 20.8 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 20.8 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 22.9 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 22.9 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 25 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 25 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 25 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 27.1 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 27.1 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 29.2 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 29.2 % elapsed: 0 min 0 sec remaining: 0 min 1 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 31.2 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 31.2 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 33.3 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 33.3 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 35.4 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 35.4 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 37.5 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 37.5 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 39.6 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 39.6 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 41.7 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 41.7 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 43.8 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 43.8 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 45.8 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 45.8 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 47.9 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 47.9 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 50 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 50 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 50 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 52.1 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 52.1 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 54.2 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 54.2 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 56.2 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 56.2 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 58.3 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 58.3 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 60.4 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 60.4 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 18.2 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 62.5 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 17.4 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 62.5 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 17.4 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 64.6 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 17.4 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 64.6 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 17.4 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 66.7 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 17.4 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 66.7 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 17.4 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 68.8 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 21.7 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 68.8 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 21.7 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 70.8 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 21.7 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 70.8 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 21.7 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 72.9 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 21.7 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 72.9 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 21.7 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 75 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 21.7 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 75 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 21.7 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 75 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 21.7 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 77.1 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 21.7 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 77.1 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 21.7 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 79.2 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 21.7 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 79.2 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 21.7 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 81.2 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 21.7 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 81.2 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 21.7 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 83.3 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 21.7 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 83.3 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 21.7 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 85.4 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 21.7 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 85.4 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 21.7 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 87.5 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 21.7 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 87.5 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 21.7 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 89.6 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 21.7 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 89.6 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 21.7 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 91.7 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 21.7 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 91.7 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 21.7 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 93.8 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 21.7 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 93.8 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 21.7 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 95.8 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 21.7 % mem: [ 13, 13, 13] MB ^M[Merging datasets ] 95.8 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 21.7 % mem: [ 13, 13, 13] MB ~

and the log file :
`
Creating input
Nb input datasets: 5

Reads per sample used: all

Maximum ressources used by Simka:
- 5 simultaneous processes for counting the kmers (per job: 9 cores, 1000 MB memory)
- 48 simultaneous processes for merging the kmer counts (per job: 1 cores, memory undefined)

Nb partitions: 48 partitions

Counting k-mers... (log files are ./simka_temp_output/simka_output_temp//log/count_*)

Kmer repartition
0: 270
1: 294
2: 300
3: 323
4: 344
5: 324
6: 275
7: 239
8: 296
9: 298
10: 297
11: 304
12: 278
13: 312
14: 311
15: 320
16: 362
17: 320
18: 303
19: 340
20: 293
21: 269
22: 358
23: 293
24: 291
25: 274
26: 344
27: 305
28: 323
29: 308
30: 302
31: 277
32: 300
33: 309
34: 338
35: 278
36: 310
37: 300
38: 284
39: 330
40: 303
41: 280
42: 317
43: 324
44: 286
45: 388
46: 294
47: 282

Merging k-mer counts and computing distances... (log files are /simka_temp_output/simka_output_temp//log/merge_*)
`

How long should this test take ?

Thank you very much in advance,
Best regards

test scripts hang with high cores count

Greetings,

running the test scripts of simka while stabilizing the upcoming Debian 11, the CI team noted that the program hangs under certain circumstances. Further investigations on my end seemed to reveal that the test was hanging past 9 cores, so as a work around, we are capping the test suite to use no further than 8 cores for the moment. You can refer to Debian bug #986256 for more details.

Do you think this would be an issue within simka, or more something intrinsic to the test data topology?

Kind Regards,
Étienne.

Paired reads input file outputting duplicate rows

Hi, thanks for this useful tool! I'm running into an error with a paired reads file, organized as:

1_1b28d4: 1_1b28d4-t_1.fq.gz ; 1_1b28d4-t_2.fq.gz 1_89e808: 1_89e808-t_1.fq.gz ; 1_89e808-t_2.fq.gz
...

I would expect for my output matrices to have one row for each sample, like:

	1_1b28d4	1_89e808
1_1b28d4	0	0.774246
1_89e808	0.774246	0

However, instead my table has duplicate rows for each of the paired ends, which cannot be distinguished:

	1_1b28d4	1_89e808
1_1b28d4	0	0.774246
1_89e808	0.774246	0
1_1b28d4	0	0.774246
1_89e808	0.774246	0

Is there a way to work around this?

Which kmers are used?

Can the kmers corresponding to a partition be recovered?

Can't open object can't open my reads

Hi there,
I built an input file of my reads. It said can't read my reads with the same error when I ran simka. The example tested successfully. Could someone show me where the problem is? Thanks very much!

Here is the error:

Creating input

Nb input datasets: 1

HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:

#000:

/scratch/fwang/simka-v1.5.3-Source/thirdparty/gatb-core/gatb-core/thirdparty/hdf5/src/H5A.c

line 425 in H5Aopen(): unable to load attribute info from object header

for attribute: 'version'

 major: Attribute

 minor: Can't open object

#1:

/scratch/fwang/simka-v1.5.3-Source/thirdparty/gatb-core/gatb-core/thirdparty/hdf5/src/H5Aint.c

line 433 in H5A__open(): unable to load attribute info from object

header for attribute: 'version'

 major: Attribute

 minor: Can't open object

#2:

/scratch/fwang/simka-v1.5.3-Source/thirdparty/gatb-core/gatb-core/thirdparty/hdf5/src/H5Oattribute.c

line 515 in H5O__attr_open_by_name(): can't locate attribute: 'version'

 major: Attribute

 minor: Object not found

HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:

#000:

/scratch/fwang/simka-v1.5.3-Source/thirdparty/gatb-core/gatb-core/thirdparty/hdf5/src/H5A.c

line 704 in H5Aget_space(): not an attribute

 major: Invalid arguments to routine

 minor: Inappropriate type

HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:

#000:

/scratch/fwang/simka-v1.5.3-Source/thirdparty/gatb-core/gatb-core/thirdparty/hdf5/src/H5S.c

line 1013 in H5Sget_simple_extent_dims(): not a dataspace

 major: Invalid arguments to routine

 minor: Inappropriate type

HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:

#000:

/scratch/fwang/simka-v1.5.3-Source/thirdparty/gatb-core/gatb-core/thirdparty/hdf5/src/H5A.c

line 662 in H5Aread(): not an attribute

 major: Invalid arguments to routine

 minor: Inappropriate type

ERROR: Can't open dataset: ID1

Here are the IDs of my reads as input:

ID1: 2018031910_paired_1.fasta ; 2018031910_paired_2.fasta

ID2: 201803193_paired_1.fasta ; 201803193_paired_2.fasta

ID3: 20180319_paired_1.fasta ; 20180319_paired_2.fasta

ID4: 20180403_paired_1.fasta ; 20180403_paired_2.fasta

ID5: 20180405_paired_1.fasta ; 20180405_paired_2.fasta

ID6: 20180410_paired_1.fasta ; 20180410_paired_2.fasta

ID7: 2018041210_paired_1.fasta ; 2018041210_paired_2.fasta

ID8: 201804123_paired_1.fasta ; 201804123_paired_2.fasta

ID9: 20180412_paired_1.fasta ; 20180412_paired_2.fasta

ID10: 2018041710_paired_1.fasta ; 2018041710_paired_2.fasta

ID11: 201804173_paired_1.fasta ; 201804173_paired_2.fasta

ID12: 20180417_paired_1.fasta ; 20180417_paired_2.fasta

ID13: 20180419_paired_1.fasta ; 20180419_paired_2.fasta

ID14: 20180424_paired_1.fasta ; 20180424_paired_2.fasta

ID15: 2018042610_paired_1.fasta ; 2018042610_paired_2.fasta

ID16: 201804263_paired_1.fasta ; 201804263_paired_2.fasta

ID17: 20180426_paired_1.fasta ; 20180426_paired_2.fasta

ID18: 20180502_paired_1.fasta ; 20180502_paired_2.fasta

ID19: 20180503_paired_1.fasta ; 20180503_paired_2.fasta

ID20: 2018050810_paired_1.fasta ; 2018050810_paired_2.fasta

ID21: 201805083_paired_1.fasta ; 201805083_paired_2.fasta

ID22: 20180508_paired_1.fasta ; 20180508_paired_2.fasta

ID23: 2018051110_paired_1.fasta ; 2018051110_paired_2.fasta

ID24: 201805113_paired_1.fasta ; 201805113_paired_2.fasta

ID25: 20180511_paired_1.fasta ; 20180511_paired_2.fasta

ID26: 20180515_paired_1.fasta ; 20180515_paired_2.fasta

ID27: 20180517_paired_1.fasta ; 20180517_paired_2.fasta

ID28: 2018052210_paired_1.fasta ; 2018052210_paired_2.fasta

ID29: 201805223_paired_1.fasta ; 201805223_paired_2.fasta

ID30: 20180522_paired_1.fasta ; 20180522_paired_2.fasta

ID31: 20180524_paired_1.fasta ; 20180524_paired_2.fasta

ID32: 2018052910_paired_1.fasta ; 2018052910_paired_2.fasta

ID33: 201805293_paired_1.fasta ; 201805293_paired_2.fasta

ID34: 20180529_paired_1.fasta ; 20180529_paired_2.fasta

Add option to trim all reads to a given length

Hi,

I would like to run Simka on multiple samples with the -max-reads option to deal with various sequencing depth.
However, the samples have also various read length.
I guess this may slighlty bias the results as longer reads increase the total number of kmers.
Would it be possible to add an option to trim all reads to a given length?

Florian

Bioconda package

I'd like to congratulate the developers on a great metagenomics software tool. I'd like to recommend the developers add simka to the bioconda channel as a package as it would enable easier installation and adoption by the community.

Best wishes,
Muslih.

Reuse simka merge data on subset of samples

Hello, thanks for your work on this tool! I wanted to ask whether it was okay to reuse previous simka merge results on a subset of the data? Eg., I have already run simka on a large set of samples, if I rerun simka using the same temporary directory, but only pass a subset of the original files as the files of interest, will this give the correct distance metrics for this subset of samples?

Visualising subsample of simka results

Hi,

I am wondering if it is possible to visualise a subsample of the simka results instead of all of the output.

Thanks,
George

multifasta file as input

--Hi,

i have to compare a multifasta file (200000 sequences) with a chromosomic region. I have already done that with kmer-db tool (https://doi.org/10.1093/bioinformatics/bty610) and i need to compare the results of kmer-db with other tool such as simka.
But i don't find the correct command to do this.
Kmer-db compute a list of distances between each sequence of the multifasta file and the chromosomic region.

this is my input file:

simka_input.txt:
A: multifasta.fasta
B: chr1_region.fasta

the command line i used:
simka -in ./simka_input.txt -out ./simka_results/ -out-tmp ./simka_temp_output -max-memory 128000 -nb-cores 24

in the simka_results directory: zcat mat_abundance_jaccard.csv.gz
;A;B
A;0.000000;0.999993
B;0.999993;0.000000

i have only 2 values whereas i have 200000 sequences in my input file, i don't understand. it seems that simka concatenates all the sequences of the multifasta file and then compares with the other file. How to avoid that ?

thank you --

Issue with test_simkaMin.py

Hello there,

I'm running into an issue when executing test_simkaMin.py after the package builds. I have tested this both with the debian package I am building as well as a fresh copy of simka untouched from this github repository. I am getting the following issue that does not result in a 100% success rate of this test:

...
python  ../../simkaMin/simkaMin.py -in  ../../example/simka_input.txt  -out __results__/k21_filter_0-1000_n1 -nb-cores 1 -max-memory 100  -kmer-size 21 -nb-kmers 1000 -bin ../../build/bin/simkaMinCore  -max-reads 0 -filter 
	- TEST ERROR:    mat_presenceAbsence_jaccard.csv
res
;A;B;C;D;E
A;0.000000;0.780808;0.940741;0.780808;0.446000
B;0.000000;0.000000;0.733333;0.000000;0.873737
C;0.000000;0.000000;0.000000;0.733333;0.970370
D;0.000000;0.000000;0.000000;0.000000;0.873737
E;0.000000;0.000000;0.000000;0.000000;0.000000

truth
;A;B;C;D;E
A;0.000000;0.783000;0.984000;0.783000;0.446000
B;0.000000;0.000000;0.918000;0.000000;0.875000
C;0.000000;0.000000;0.000000;0.918000;0.992000
D;0.000000;0.000000;0.000000;0.000000;0.875000
E;0.000000;0.000000;0.000000;0.000000;0.000000

This is currently blocking the completion of the debian package1. If there are any recommendations or remedies of how this can be patched or fixed upstream, I'd greatly appreciate it.

Many thanks,
Shayan Doust

-max-reads 0

Hello! I hope you are well. simka looks like a great tool!

How are the samples normalized with the -max-reads 0 flag? I did not see a description of this in the paper.

Have you considered normalization options such as suggested here:
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003531 ?

Or transformation options suggested here:
https://www.frontiersin.org/articles/10.3389/fmicb.2017.02224/full ?

Since the default is not to normalize, is the intended workflow to subset all samples to the same number of reads prior to running simka?

Have you tested how much the size discrepancies actually affect the various distance metrics?

Thanks for the clarification.

best,
Roth

.fastq.gz support

Hi, I run into this error via using this input (build from the latest version)

WP1310: /condo/ieg/qiqi/Haibei_metaG/WP1310_paired_1.fastq.gz ; /condo/ieg/qiqi/Haibei_metaG/WP1310_paired_2.fastq.gz

the error is: ERROR: Can't open dataset: WP1310

Any idea why?

Testing ran successfully.

It takes me half a day to find out why. Support for fastq.gz is not ready?

Thanks,

Jianshu

gatb / simka Goto Github PK

simka's Introduction

Simka & SimkaMin

Continuous integration status (master branch)

Build status

SonarQube metrics

What is Simka?

References

Install a binary release of simka

Install simka from source code: git clone

Install simka from source code: using a source release archive

Changelog

User manual

Description

Input

Output

Temporary output

Result output

Visualize simka results

Usage for simka

Simka command examples

Computer cluster options

Possible issues with Simka

TOO MUCH OPENED FILES

simka's People

Contributors

Stargazers

Watchers

Forkers

simka's Issues

Recommend Projects

Recommend Topics

Recommend Org