Coder Social home page Coder Social logo

felipelouza / egap Goto Github PK

View Code? Open in Web Editor NEW
6.0 3.0 2.0 6.91 MB

External memory BWT and LCP computation for sequence collections with applications [WABI'18, AMB 2019]

Home Page: https://doi.org/10.1186/s13015-019-0140-0

Makefile 0.61% C 61.58% Python 5.21% C++ 32.60%
burrows-wheeler-transform lcp-array string-collections

egap's Introduction

eGap: BWT and LCP computation for sequence collections in external memory

This software is an implementation of the eGap algorithm described in External memory BWT and LCP computation for sequence collections with applications by L. Egidi, F. A. Louza, G. Manzini, and G. P. Telles, Algorithms for Molecular Biology (2019).

Copyright 2017-2019 by the authors.

Prerequisites

  • A relatively recent version of gcc
  • Python 3.X

Install

git clone https://github.com/felipelouza/egap.git
cd egap

Compile

make 

Quick test

./eGap --lcp -m 4096 dataset/reads.fastq

This will produce the file dataset/reads.fastq.bwt and dataset/reads.fastq.2.lcp containing the BWT and LCP array (the latter using 2 bytes per entry). The computation will use 4GB (4096 MB) of RAM

Description

Tool to build the BWT and optionally the LCP, DA and SA array for a collection of sequences in external memory. There are two different usages depending on whether you already have the BWT of the input files:

  • If you do have the BWTs use option -b: you must specify the file names on the command line and use the option -o to specify an output basename. For example: eGap -b --lcp -o merge -m 4096 file1.bwt file2.bwt will produce the output files: merge.bwt, merge.2.lcp, merge.da. Globbing is accepted: multiple file names can be denoted for example as file?.bwt

  • If you don't have the BWTs then your input must consists of a single file with extension .fasta/.fa (one input document per sequence) .fastq (one input document per sequence) .txt (one input document per line) and it is not mandatory to specify the output basename. For example: eGap --lcp -m 4096 file.fasta will produce the output files: file.fasta.bwt, files.fasta.2.lcp

All input and output files are uncompressed. The value 0 is used as the eof symbol in the output BWT.

Main command line options

-m, --mem specify memory assigned to the algorithm in MB. Default is 95% of the available RAM

-o, --out
specify basename for output and temporary files

-b, --bwt
inputs are bwt files (requires -o)

-l, --lcp
compute LCP Array

--rev
compute data structures for the reversed string

--lbytes
number of bytes for each LCP entry (def. 2)

-v verbose output in the log file

-h, --help show usage

Suffix array and document array computation

Use the options:

-d, --da
compute Document Array

-c, --cda
compute colored Document Array (one color for each file)

-s, --sa
compute Suffix Array

--sl
output suffixes' lengths (SL)

--dbytes
number of bytes for each DA entry (def. 4)

--cbytes
number of bytes for each colored DA entry (def. 4)

--sbytes
number of bytes for each SA entry (def. 4)

--slbytes
number of bytes for each SL entry (def. 4)

Document array requirements

If you want to simultaneaouly merge BWT files and compute the Document Array for each input BWT you must provide, in addition to the DA, also a .docs file for containing the number of documents in the file in 64 bits little endian format. The .docs file is automatically computed when the option -d, --da is used.

Example

./eGap -m 4096 --em dataset/file1.fastq -o file1 --da
./eGap -m 4096 --em dataset/file2.fastq -o file2 --da

./eGap -m 4096 --em --bwt -o merge file1.bwt file2.bwt --da

The first two commands compute file1.bwt, file1.da, file1.docs and file2.bwt, file2.da, file2.docs which are used by the third command to compute merge.bwt, merge.da, and merge.docs

Applications

Truncated LCP values

The running time of eGap can be decreased if, instead of the true LCP values, the user settles for computing the LCP values up to a certain threshold k. Using the option --trlcp k, as an altenative to --lcp, the algorithm computes an LCP array in which all values greater than k are replaced by the value k.

--trlcp k truncate LCP values to the value k

Bruijn graph info (BOSS)

Another option offered by eGap, is to compute the info required for the construction of the succinct (BOSS) representation of the de Bruijn graph associated to the input sequences. Using the option --deB k eGap compute two bitfiles with extension .lcpbit0 and .lcpbit1 which, together with the BWT, can be used to compute the BOSS representation of the de Bruijn graph as described in the Application section of the AMB paper.

--deB K compute info for building the order-K deBruijn graph

Notice: if the options --trlcp or --deB are used, suffixes are sorted only up the first k symbols so the resulting BWT will not be the standard one.

Quality score (QS) sequences

eGap can output the Quality Scores (QS) of a FASTQ file permuted according to the BWT of the DNA bases (allowed only for .fastq files).

--qs QS permuted according to the BWT (output file .bwt.qs)

For example: eGap -m 4096 file.fastq --qs will produce the output files: file.fastq.bwt, file.fastq.bwt.qs

Datasets

We have compared eGap with the available BWT/LCP construction tools on the following collections

Name SizeGBs Num Docs Max DocLen Ave DocLen Max LCP Ave LCP Download Link
Shortreads 8.0 85,899,345 100 100 99 27.90 .tar.gz
Longreads 8.0 28,633,115 300 300 299 90.28 .tar.gz
Pacbio.1000 8.0 8,589,934 1,000 1,000 876 18.05 .tar.gz
Pacbio 8.0 942,248 71,561 9,116 3,084 18.32 .tar.gz

We have also used versions of the above collections shortened to 1GB. The shortened versions can be obtained by the above files using simple command line instructions. Check all md5sums after dowloading and extraction.

The results of our experiments are reported on the above AMB paper.

Thanks

Thanks to Pierre Peterlongo by helpful suggestions and debugging.

egap's People

Contributors

felipelouza avatar giovmanz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

egap's Issues

Paired-End Reads

Does this tool support an input of paired end reads R{1,2}? Is it a planned feature?
If not, does an interleaved fastq file suffice?

Thanks.

Readme toy example failed

Hi there,

I tried today egap on a mac OSX 10.13.6, with 16GB Ram, using Apple LLVM version 10.0.0 (clang-1000.11.45.2).

Despite some warnings, the compilation went fine.
The README toy example led to this error

./eGap --lcp -m 4096 dataset/reads.fastq
Sending logging messages to file: dataset/reads.fastq.eGap.log
==== gSACAK
 Command: ./tools/gsacak -b -m 4096  dataset/reads.fastq 0
Error executing command line:
	./tools/gsacak -b -m 4096  dataset/reads.fastq 0
Check log file: dataset/reads.fastq.eGap.log


more dataset/reads.fastq.eGap.log 
>>> Begin computation
>>> eGap version v2.0
Python command line: ./eGap --lcp -m 4096 dataset/reads.fastq 
--- Phase 1 ---
malloc_count ### free(0x7fb851500200) has no sentinel !!! memory corruption?
gsacak(25142,0x7fff9c76a380) malloc: *** error for object 0x7fb851500200: pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug
##
RAM = 4.00 GB
max(chunk) = 858993459 symbols

Any ideas?
Pierre

Add pre/post strings to unbwt

in unbwt there should be two optional command line parameters to specify a string to be output before and after each reconstructed sequence. Default: pre="" post = "\n"

Make POS_SIZE a command line parameter for gap

POS_SIZE is the number of bytes used to represent a position inside the final BWT. It is currently set to 5 inside config.h. If we make it a command line parameter we are ready for datasets larger than 2^40. Note that eGap knows in advance the size of the final BWT so it can pass the right value to both gap and mergelcp

.fasta size requirement?

Hello, when input .fasta file size smaller than 2G (I tested on 2.5M, 25M, 242M, and 1.2G), eGap works well. Example of run 1.2G test.fasta:

/egap/eGap -m 4096 test.fasta
Sending logging messages to file: test.fasta.eGap.log
==== gSACAK
Command: egap/tools/gsacak -b -m 4096 test.fasta 0
Elapsed time: 810.9151
==== gap (internal memory)
Command: egap/gap2 -g256 -vaT test.fasta
Elapsed time: 0.0030
==== Done
Total construction time: 810.9186 usec/byte: 0.6478 (outsize: 1251720450)

While when test.fasta is 2.2G, I failed.

egap/eGap -m 4096 test2.fasta
Sending logging messages to file: test2.fasta.eGap.log
==== gSACAK
Command: egap/tools/gsacak -b -m 4096 test2.fasta 0
Error executing command line:
egap/tools/gsacak -b -m 4096 test2.fasta 0
Check log file: test2.fasta.eGap.log

I try to increase -m, but still failed.
egap/eGap -m 8096 test2.fasta
Sending logging messages to file: test2.fasta.eGap.log
==== gSACAK
Command: egap/tools/gsacak -b -m 8096 test2.fasta 0
Error executing command line:
egap/tools/gsacak -b -m 8096 test2.fasta 0
Check log file: test2.fasta.eGap.log

The content in test2.fasta.eGap.log:

Begin computation
Python command line: egap/eGap -m 8096 test2.fasta
--- Phase 1 ---

And I continue to run "egap/tools/gsacak -b -m 8096 test2.fasta 0", I meet a Segmentation fault.
egap/tools/gsacak -b -m 8096 test2.fasta 0

RAM = 7.91 GB
max(chunk) = 1697854259 symbols
K = 1
N = 18446744071741255078
CHUNKS = 1
sizeof(int_t) = 4 bytes

Segmentation fault

Could do tell me what have I done wrong? Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.