Coder Social home page Coder Social logo

refresh-bio / colord Goto Github PK

View Code? Open in Web Editor NEW
46.0 4.0 11.0 5.66 MB

A versatile compressor of third generation sequencing reads.

License: GNU General Public License v3.0

Makefile 0.28% Shell 0.06% C++ 90.52% C 9.14%
bioinformatics compression fastq-files genomics long-reads oxford-nanopore pac-bio sequencing

colord's Introduction

CoLoRd - Compressing long reads

GitHub downloads Bioconda downloads GitHub Actions CI License: GPL v3

A versatile compressor of third generation sequencing reads.

Quick start

git clone --recurse-submodules https://github.com/refresh-bio/colord
cd colord && make
cd bin

INPUT=./../test

# default compression presets (lossy quality, memory priority)
./colord compress-ont ${INPUT}/M.bovis.fastq ont.default 		# Oxford Nanopore
./colord compress-pbhifi ${INPUT}/D.melanogaster.fastq hifi.default	# PacBio HiFi 
./colord compress-pbraw ${INPUT}/A.thaliana.fastq clr.default 		# PacBio CLR/subreads

# print ONT archive information and decompress
./colord info ont.default
./colord decompress ont.default ont.fastq

# compress HiFi reads preserving original quality levels
./colord compress-pbhifi -q org ${INPUT}/D.melanogaster.fastq hifi.lossless

# compress CLR reads with ratio priority using 48 threads
./colord compress-pbraw -p ratio -t 48 ${INPUT}/A.thaliana.fastq clr.ratio

# compress ONT reads w.r.t. reference genome (embed the reference in the archive)
./colord compress-ont -G ${INPUT}/M.bovis-reference.fna -s ${INPUT}/M.bovis.fastq ont.refbased

# decompress the reference-based archive
./colord decompress ont.refbased ont.refbased.fastq

Installation and configuration

CoLoRd comes with a set of precompiled binaries for Windows, Linux, and OS X. They can be found under Releases tab. The software is also available on Bioconda:

conda install -c bioconda colord

For detailed instructions how to set up Bioconda, please refer to the Bioconda manual. CoLoRd can be also built from the sources distributed as:

  • Visual Studio 2019 solution for Windows,
  • MAKE project (G++ 8.4 required) for Linux and macOS.

To install G++ under under macOS, one can use Homebrew package manager:

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
brew install gcc@10

Before running CoLoRd on macOS, the current limit of file descriptors should be increased:

ulimit -n 2048

Usage

Compression

colord <mode> [options] <input> <archive>

Modes:

  • compress-ont - compress Oxford Nanopore reads,
  • compress-pbhifi - compress PacBio HiFi reads,
  • compress-pbraw - compress PacBio CLR/subreads.

Positionals:

  • input - input FASTQ/FASTA path (gzipped or not),
  • output - archive path.

Options:

  • -h, --help - print help
  • -k, --kmer-len - k-mer length, (15-28, default: auto adjust)
  • -t, --threads - number of threads (default: 12)
  • -p, --priority - compression priority: memory, balanced, ratio (default: memory)
  • -q, --qual - quality compression mode:
    • org - original,
    • none - discard (Q0 for all bases),
    • avg - average over entire file,
    • 2-fix,4-fix,5-fix - 2/4/5 bins with fixed representatives,
    • 2-avg,4-avg,5-avg - 2/4/5 bins with averages as representatives; default value depends on the mode (4-avg for ont, 5-avg for pbhifi, none for pbraw),
  • -T, --qual-thresholds - quality thresholds:
    • single value for 2-fix/2-avg (default: 7),
    • three values for 4-fix/4-avg (default: 7 14 26),
    • four values for 4-fix/4-avg (default: 7 14 26 93),
    • not allowed for avg, org and none modes,
  • -D, --qual-values - bin representatives for decompression,
    • single value for none mode (default: 0),
    • two values for 2-fix mode (default: 1 13),
    • four values for 4-fix mode (default: 3 10 18 35),
    • five values for 5-fix mode (default: 3 10 18 35 93),
    • not allowed for avg, org, 2-avg, 4-avg and 5-avg modes,
  • -G, --reference-genome - optional reference genome path (multi-FASTA gzipped or not), it enables reference-based mode which provides better compression ratios,
  • -s, --store-reference - stores the reference genome in the archive, use only with -G flag,
  • -v, --verbose - verbose mode.

Advanced options (default values may depend on the mode - please run colord --help <mode> to get the details):

  • -a, --anchor-len - anchor len (default: auto adjust),
  • -L, --Lowest-count - minimal k-mer count,
  • -H, --Highest-count - maximal k-mer count,
  • -f, --filter-modulo - k-mers for which hash(k-mer) mod f != 0 will be filtered out before graph building,
  • -c, --max-candidates - maximal number of reference reads considered as reference,
  • -e, --edit-script-mult - multipier for predicted cost of storing read part as edit script,
  • -r, --max-recurence-level - maximal level of recurence when considering alternative reference reads,
  • --min-to-alt - minimum length of encoding part to consider using alternative read,
  • --min-mmer-frac - if A is set of m-mers in encode read R then read is refused from encoding if |A| < min-mmer-frac * len(R),
  • --min-mmer-force-enc - if A is set of m-mers in encode read R then read is accepted to encoding always if |A| > min-mmer-force-enc * len(R),
  • --max-matches-mult - if the number of matches between encode read R and reference read is r, then read is refused from encoding if r > max-matches-mult * len(R),
  • --fill-factor-filtered-kmers - fill factor of filtered k-mers hash table,
  • --fill-factor-kmers-to-reads - fill factor of k-mers to reads hash table,
  • --min-anchors - if number of anchors common to encode read and reference candidate is lower than minAnchors candidate is refused,
  • -i, --identifier header compression mode - main/none/org (default: org),
  • -R, --Ref-reads-mode - reference reads mode: all/sparse (default: sparse),
  • -g, --sparse-range - sparse mode range. The propability of reference read acceptance is 1 / pow(id/range_reads, exponent), where range_reads is determined based on the number of symbols, which in turn is determined by the number of trusted unique k-mers (estimated genome length) multiplied by the value of this parameter,
  • -x, --sparse-exponent - sparse mode exponent.

Hints

While the number of CoLoRd parameters is large, in most cases the default values will work just fine. In terms of compression, there is always a trade off between compression ratio and resource requirements (mainly memory and compute time). If the default behavior of CoLoRd is insufficient, the first attempt should be the change of compression priority mode (-p parameter). The compression priority modes aggregate multiple other parameters influencing compression ratio. There are the following priority modes (ordered increasingly w.r.t. the compression efficiency and resource requirements):

  • memory
  • balanced
  • ratio

The memory priority mode is the default.

Quality scores have a high impact on the compression. They are hard to compress due to their nature and, at the same time (as presented in the paper) their resolution can be safely reduced without affecting downstream analyses. For this reason, in each priority mode, the quality scores are compressed lossy. If it is required to keep the original quality scores, one should use -q org. Note, that there exist several other quality compression modes (see the paper).

Here are compression results for a large set of human reads NA12878 with a total size of 268,305,314,354 bytes.

Lossy Lossless
Compressed in memory mode size [B] 42,120,596,486 105,807,350,384
Compressed in balanced mode size [B] 39,833,878,505 103,367,993,362
Compressed in ratio mode size [B] 38,832,714,102 101,305,368,675
Time in memory mode [h:mm:ss] 1:12:42 1:26:02
Time in balanced mode [h:mm:ss] 1:33:18 2:11:21
Time in ratio mode [h:mm:ss] 3:18:46 4:57:09
Memory in memory mode [KB] 13,715,168 14,341,128
Memory in balanced mode [KB] 26,728,108 27,293,824
Memory in ratio mode [KB] 97,922,208 99,133,548

If one wants to check how much CoLoRd can squeeze the input data regardless of the resource requirements, the ratio mode should be used. If more control over execution is in demand, the remaining parameters may be configured. The simplest way to settle the direction without the need to understand the meaning of parameters is to display the defaults for a given compression priority mode with --help switch. For example, let's say you want to find out if you should increase or decrease the -f parameter to improve the compression ratio while compressing ONT data. You may run CoLoRd twice with the following parameters:

./colord compress-ont --help -p balanced
./colord compress-ont --help -p ratio

You will notice the default for -f is higher for balanced mode, which means lowering it will increase the compression ratio. The same approach may be applied for other parameters (-L, -H, -c, -r, --min-to-alt, etc.).

In the ratio priority mode all the input reads may serve as a reference to encode other reads. This will increase RAM usage, especially for large datasets. In the remaining modes, only part of the reads may serve as a reference. If needed -g and -x may be used.

The values for -k and -a parameters are auto-adjusted based on the size of the data to be compressed. The general rule is, the larger the input size is, the values of these parameters should be higher.

Decompression

colord decompress [options] <archive> <output>

Positionals:

  • input - archive path,
  • output - output file path.

Options:

  • -h, --help - print help,
  • -G, --reference-genome - optional reference genome path (multi-FASTA gzipped or not), required for reference-based archives with no reference genome embedded (-G compression without -s switch),
  • -v, --verbose - verbose mode.

Archive information

colord info <archive>

API

CoLoRd comes with a C++ API allowing straightforward access to the existing archive. Below one can find an example of using API in the code.

#include "colord_api.h"
#include <iostream>

int main(int argc, char** argv) {
	try {
		colord::DecompressionStream stream("archive.colord");	// load a CoLoRd archive
		auto info = stream.GetInfo();				// get and print archive information
		std::cerr << "Archive info:\n\n";			//
		info.ToOstream(std::cerr);				//	

     		// iterate over records in the archive
		while (auto x = stream.NextRecord()) {
			if (info.isFastq) {
				std::cout << "@" << x.ReadHeader() << "\n";
				std::cout << x.Read() << "\n";
				std::cout << "+" << x.QualHeader() << "\n";
				std::cout << x.Qual() << "\n";
			} else {
				std::cout << ">" << x.ReadHeader() << "\n";
				std::cout << x.Read() << "\n";
			}
		}
	}
	catch (const std::exception& ex) {
		std::cerr << "Error: " << ex.what() << "\n";
		return -1;
	}	
	return 0;
}

Compiling own code utilizing colord API

To use an API one needs to include colord_api.h header file and link against libcolord_api.a. libcolord_api.a uses std::threads and zlib, so -lpthreads and -lz flags are needed for linking. For example, to compile and link the code above one could use the following command:

g++ -O3 $SRC_FILE -I$INCLUDE_DIR $LIB_DIR/libcolord_api.a -lz -lpthread -o example -no-pie

where

  • SRC_FILE is a path to a source code
  • INCLUDE_DIR is a path of the directory where colord_api.h file is (when one compiles colord from sources there is include directory created at the same location where Makefile is)
  • LIB_DIR is a path of the directory where libcolord_api.a file is (when one compiles colord from sources there is bin directory created at the same location where Makefile is, it contains (among others) libcolord_api.a)

Citing

Kokot, M., Gudyś, A., Li, H. and Deorowicz, S. (2022) CoLoRd: Compressing long reads. Nature Methods, https://doi.org/10.1038/s41592-022-01432-3

colord's People

Contributors

agudys avatar marekkokot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

colord's Issues

Nonbinary output format?

Hi, I don't know if this makes sense for the output format, but I was wondering if CoLoRd could output a plaintext format? I am so used to manually inspecting files as a sort of sanity check that having a bespoke binary format might get in the way of our workflow. Or also if the compressed binary file makes its way to another computer without CoLoRd, it is possible that there will be no way to inspect the file.

I understand that it might not be possible with however it's structured and also that it will not be as compressed just like sam vs bam.

So anyway, is there a way to see what's under the hood? A way to pass around plaintext?

Compression with reference genome

hello,Thanks for the excellent compression tools, I have some problems using the compression method with the reference genome,
I am compressing the FASTA data, and the reference genome is GRC38,I sampled the reference genome, that is to say, I did not use all the reference genome. The size of the original reference genome was 3G. I found that whether using the complete reference genome or half or even one-tenth of the reference genome, the compression rate nothing much has changed,I used Li Heng's samtools to check the comparison results. In fact, only about one-fifth of the data was compared. I want to know whether the comparison rate has a great impact on the compression performance of colord? Why do I get similar compression results with the full reference genome and with one-tenth of the reference genome (the smaller the reference genome, the lower the alignment rate)?
Looking forward to your reply,Best wishes!

`mimalloc` submodule missing

After last commit CoLoRd cannot be copiled with such error:

(base) colord$ make
g++ -DMI_MALLOC_OVERRIDE -O3 -DNDEBUG -fPIC -Wall -Wextra -Wno-unknown-pragmas -fvisibility=hidden -ftls-model=initial-exec -fno-builtin-malloc -c -I src/colord/libs/mimalloc/include src/colord/libs/mimalloc/src/static.c -o src/colord/libs/mimalloc/mimalloc.o
g++: error: src/colord/libs/mimalloc/src/static.c: No such file or directory
g++: fatal error: no input files
compilation terminated.
make: *** [Makefile:82: src/colord/libs/mimalloc/mimalloc.o] Error 1

Tested on two environments.

Segmentation fault (core dumped)

Hi, I am trying to compress a PacBio HiFi GIAB sample (https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/ChineseTrio/HG006_NA24694-huCA017E_father/PacBio_CCS_15kb_20kb_chemistry2/uBAMs/m64017_191213_003759.hifi_reads.bam). With this specific sample I always get a "Segmentation fault (core dumped)" message during or after "Counting k-mers". I use the following command:

colord compress-pbhifi --qual org --threads 8 --reference-genome hs38DH.fa m64017_191213_003759.hifi_reads.fastq.gz m64017_191213_003759.hifi_reads.fastq.colord

The BAM file was converted to fastq.gz with the pbtk bam2fastq (from here: https://github.com/PacificBiosciences/pbtk#bam2fastx).

I am using colord 1.2.0.

Other samples worked fine, but I am having trouble with this specific one.
Any ideas?

maxKmerCount not working

Dear colord developers team!

Hope you are doing well.

Recently I've found critical bug in colord source code which lead to maxKmerCount not working breaking logic suggested in the paper.
Here is detailed description and fix.

Best regards,
Alexey

Malloc error: pointer being freed was not allocated

Hi!

I tried to run this tool on Mac OS (Monterey) and have problems running provided test examples.

I had some problems with conda on my computer, so I compiled it myself.

Running make resulted in clang: error: unsupported option '-static-libgcc'. After I deleted this flag from Makefile it compiled. When I tried to test it using your instructions, I got this error:

> ./colord compress-ont ./../test/M.bovis.fastq ont.default 
colord(28784,0x1064c6600) malloc: *** error for object 0xb43ac010120: pointer being freed was not allocated
colord(28784,0x1064c6600) malloc: *** set a breakpoint in malloc_error_break to debug
zsh: abort      ./colord compress-ont ./../test/M.bovis.fastq ont.default

I have the same problem for all other test examples.

Is there anything you can suggest? Thanks in advance

ARM Architecture Support (Apple Silicon)

Hello!

I tried to run this tool with M1 MacBook on Mac OS. Unfortunately, it did not work.

Installation via conda install -c bioconda colord shows error:

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are not available from current channels:

  - colord

Current channels:

  - https://conda.anaconda.org/bioconda/osx-arm64
  - https://conda.anaconda.org/bioconda/noarch
  - https://repo.anaconda.com/pkgs/main/osx-arm64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/osx-arm64
  - https://repo.anaconda.com/pkgs/r/noarch
  - https://conda.anaconda.org/conda-forge/osx-arm64
  - https://conda.anaconda.org/conda-forge/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.

I decided to try to compile it by myself. brew install gcc@10 also leads to the error:

(base) alexey@Alexeys-MacBook-Pro:~$ brew install gcc@10
gcc@10: The x86_64 architecture is required for this software.
Error: gcc@10: An unsatisfied requirement failed this build.

I managed to install gcc and edit the makefile to find and use it. But finally I've got a compilation error:

(base) alexey@Alexeys-MacBook-Pro:~/Downloads/colord$ make
/opt/homebrew/bin/g++ -Wall -O3 -std=c++17 -static-libgcc -static-libstdc++ -pthread  -I src/colord/../common/libs/zlib -I src/colord/libs/kmc_api -I src/colord/libs/edlib -I src/colord/libs/CLI11 -c src/colord/timer.cpp -o src/colord/timer.o
/opt/homebrew/bin/g++ -Wall -O3 -std=c++17 -static-libgcc -static-libstdc++ -pthread  -I src/colord/../common/libs/zlib -I src/colord/libs/kmc_api -I src/colord/libs/edlib -I src/colord/libs/CLI11 -c src/colord/stats_collector.cpp -o src/colord/stats_collector.o
/opt/homebrew/bin/g++ -Wall -O3 -std=c++17 -static-libgcc -static-libstdc++ -pthread  -I src/colord/../common/libs/zlib -I src/colord/libs/kmc_api -I src/colord/libs/edlib -I src/colord/libs/CLI11 -c src/colord/reads_sim_graph.cpp -o src/colord/reads_sim_graph.o
In file included from src/colord/kmer_filter.h:26,
                 from src/colord/reads_sim_graph.h:21,
                 from src/colord/reads_sim_graph.cpp:19:
src/colord/hs.h:21:10: fatal error: mmintrin.h: No such file or directory
   21 | #include <mmintrin.h>
      |          ^~~~~~~~~~~~
compilation terminated.
make: *** [src/colord/reads_sim_graph.o] Error 1

It seems the problem is because of using SSE commands which are implemented only under x64/x86 architecture. I think a possible solution is sse2neon but I am not sure it's the best and universal solution.

Later I can check using of this tool under ARM Linux.

An Exception Test

Dear CoLoRd developer. We used CoLoRd for FastQ Long Reads' no-reference compression experiment. In the dataset ERR11011595(https://www.ebi.ac.uk/ena/browser/view/ERR11011595) , run the following command: /bin/time -v -p colord compress-ont -q org -p ratio -t 16 ERR11011595.fastq ERR11011595.colord . We measured memory and time using the /bin/time -v -p instruction, and the result was a compression time of up to 45.521 hours, while the dataset size was only 4.411 GB, which is not consistent with our understanding of CoLoRd's superior compression performance. Do you know what the problem is...? TKU!

compile error

Hi,
when I compile the code, it shows the error as follows:

huangneng@bio2:~/tools/CoLoRd-master$ make
g++ -Wall -O3 -std=c++17 -static -Wl,--whole-archive -lstdc++fs -lpthread -Wl,--no-whole-archive  -I src/colord/../common/libs/zlib -I src/colord/libs/kmc_api -I src/colord/libs/edlib -I src/colord/libs/CLI11 -c src/colord/utils.cpp -o src/colord/utils.o
src/colord/utils.cpp:28:10: fatal error: filesystem: No such file or directory
 #include <filesystem>
          ^~~~~~~~~~~~
compilation terminated.
Makefile:83: recipe for target 'src/colord/utils.o' failed
make: *** [src/colord/utils.o] Error 1

OS: Ubuntu 16.04
GCC: 7.5.0

best
Neng

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.