Coder Social home page Coder Social logo

esscompress's Introduction

ESSCompress v3.1

A tool to compress a set of k-mers represented in FASTA/FASTQ/KFF file(s).

Installation

There are 2 ways to install ESS-Compress: either from source or from pre-compiled binaries.

1. Installation from source

Pre-requisites

  • Linux operating system (64 bit)

  • Git

  • GCC >= 4.8 or a C++11 capable compiler

  • CMake 3.1+

Steps

Download source and install:

git clone https://github.com/medvedevgroup/ESSCompress
cd ESSCompress
./INSTALL

Upon successful execution of this script, you will see linux binaries for kff-tools (essAuxKffTools), Blight (essAuxBlight), BCALM (essAuxBcalm), DSK (essAuxDsk and essAuxDsk2ascii) and MFCompress (essAuxMFCompressC and essAuxMFCompressD) in the aux folder, along with essAuxValidate, essAuxCompress and essAuxDecompress and getMaxLen.

2. Installation from pre-compiled binaries

Requirements

  • Linux operating system (64 bit)

Steps

  1. Download the latest Linux 64-bit binaries wget https://github.com/medvedevgroup/ESSCompress/releases/download/v3.1/essCompress-v3.1-linux-64.tar.gz

  2. Extract the .tar.gz file and change into uncompressed directory.
    tar xvzf essCompress-v3.1-linux-64.tar.gz
    cd essCompress-v3.1/

  3. You will see two executables in the directory named essCompress and essDecompress.

    • You can either refer to these two executables directly when compressing/decompressing (using the command ./essCompress and ./essDecompress),

    • Or, you can move/copy ALL the executables in essCompress-v3.1/bin to the bin directory that is already in your PATH. For instance, considering /usr/bin is already in PATH, you need to run the command mv ess* /usr/bin to move all executables for ESS-Compress software. An alternative to moving/copying executables is adding the location of essCompress-v3.1/bin to your PATH.

Quick start with a step-by-step example

This example assumes that you are currently inside the base directory essCompress-v3.1 after you have completed installing the tool as per the instructions.

Lets say you have a small fasta file of sequences, i.e. examples/smallExample.fa, and
cat examples/smallExample.fa returns

>
AAAAAAACCCCCCCCCC
>
CCCCCCCCCCA

We can compress it using k=11 as follows

./bin/essCompress -k 11 -i examples/smallExample.fa

Now ls examples will show both original input file and compressed file in the same directory:

smallExample.fa
smallExample.fa.essc
...

smallExample.fa.essc is a compressed binary file generated by MFCompress, so it is not in a readable format.

To decompress into a readable format, you can run

./bin/essDecompress examples/smallExample.fa.essc   

You'll now see the decompressed file example.fa.essd in the same directory.
cat examples/smallExample.fa.essd will return:

>
AAAAAAACCCCCCCCCCA

Notice that the decompressed fasta file is not the same as the original file, but it contains the same k-mers as smallExample.fa. You can double check this using the command
./bin/essAuxValidate 11 examples/smallExample.fa examples/smallExample.fa.essd
If they contain the same k-mers (i.e. 11-mers), you will see an output like this:

### SUCCESS: The two files contain same k-mers! ###

Usage details

essCompress: compression of a k-mer set

Syntax: ./essCompress [parameters] 

mandatory arguments:
-k [int]          k-mer size (must be >=4)
-i [input-file]   Path to input file. Input file can be either of these 3 formats:
                     1. a single fasta/fastq file (either gzipped or not)   
                     2. a single text file containing the list of multiple fasta/fastq files (one file per line)
                     3. a single .kff file. In this case, output is a .kff file after compressing in UST mode.

optional arguments:
-a [int]          Default=1. Sets a threshold X, such that k-mers that appear less than X times in the input dataset are filtered out. 
-o [output-dir]   Specify output directory
-t [int]          Default=1. Number of threads (used by bcalm, dsk and blight). 
-x [int]          Default=1. Bytes allocated for associated abundance data per k-mer in kff. For highest compression with kff, by default the program limits 1 byte per k-mer (max value 255).   
-f                Fast compression mode: uses less memory, but achieves smaller compression ratio.
-u                UST mode (output an SPSS, which does not contain any duplicate k-mers and the k-mers it contains are exactly the distinct k-mers in the input. A k-mer and its reverse complement are treated as equal.)   
-d                DEBUG mode. If debug mode is enabled, no intermediate files are removed.
-v                Enable verbose mode: print more useful information.
-c                Verify correctness: check that all the distinct k-mers in the input file appears exactly once in compressed file.
-h                Print this Help
-V                Print version number

Input for essCompress

Two important input parameters are

  • input [-i]
  • k-mer size [-k]

If input is a .kff file, [-k] parameter is disregarded.

File input format can be
1. a single fasta or fastq file (either gzipped or not)
2. a single text file containing the list of multiple fasta/fastq files (one file per line)
3. a single .kff file. In this case, output is a .kff file after compressing in UST mode.

To pass a single FASTA file as input and compress: ./bin/essCompress -i examples/11mers.fa -k 11

To pass a single KFF file as input and compress: ./bin/essCompress -i examples/kmc_k15.kff

To pass several files as input, generate the list of files (one file per line) as follows:

ls -1 examples/*.fa > list_reads   
./bin/essCompress -i list_reads -k 5

ESS-Compress uses BCALM 2 under the hood, which does not care about paired-end information, all given reads contribute to k-mers in the graph (as long as such k-mers pass the abundance threshold).

Output for essCompress

If using fast mode/normal mode: the compressed output is in a file with .essc extension.

If using UST mode without kff: the compressed output is in a file with .fa.essd extension.

If compressing a kff file: the compressed output is in a file with .compressed.kff extension.

essDecompress: decompression of .essc file

    Syntax: ./essDecompress [file_to_decompress]

Input: a .essc file generated by essCompress
Output: a fasta file with .essd extension, where all the distinct k-mers represented by the input .essc file appear exactly once. In other words, output is a spectrum-preserving string set.

Citation

If using ESS-Compresss in your research, please cite

  • Amatur Rahman, Rayan Chikhi and Paul Medvedev, Disk compression of k-mer sets, WABI 2020.

esscompress's People

Contributors

amatur avatar pashadag avatar

Stargazers

Len Boyette avatar Austin Richardson avatar Yoshihiro Shibuya avatar Jie Zhu avatar Karel Břinda avatar

Watchers

James Cloos avatar  avatar  avatar  avatar

Forkers

amatur

esscompress's Issues

How to change the number of cpu cores used?

Is it possible to change the number of threads used by essCompress? There does not appear to be a flag, and by default it appears to just use all cores.

Alternatively, is it possible to prevent essCompress from running bcalm2 internally, and instead give it the output of bcalm2 as input?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.