shubhamchandak94 / spring Goto Github PK

FASTQ compression

License: Other

Shell 1.77% C++ 40.30% Makefile 0.53% C 41.52% Python 3.09% Cuda 7.60% CMake 2.74% Assembly 0.06% SAS 0.04% CLIPS 0.13% Pascal 0.66% HTML 0.71% DIGITAL Command Language 0.63% Module Management System 0.04% Roff 0.10% Perl 0.09%

fastq-files compression illumina-binning fastq-compression sequencing

spring's Introduction

SPRING

Bioinformatics publication

Check out specialized tool for compressing nanopore long reads: https://github.com/qm2/NanoSpring

SPRING is a compression tool for Fastq files (containing up to 4.29 Billion reads):

Near-optimal compression ratios for single-end and paired-end datasets
Fast and memory-efficient decompression
Supports variable length short reads of length upto 511 bases (without -l flag)
Supports variable length long reads of arbitrary length (upto 4.29 Billion) (with -l flag). This mode directly applies general purpose compression (BSC) to reads and so compression gains might be lower than those without -l flag.
Supports lossless compression of reads, quality scores and read identifiers
Supports reordering of reads (while preserving read pairing information) to boost compression
Supports quantization of quality values using QVZ, Illumina 8-level binning and binary thresholding
Supports decompression of a subset of reads (random access)
Supports gzipped fastq files as input (output) during (de)compression
Tested on Linux and macOS

Note: If you want to use SPRING only as a tool for reordering reads (approximately according to genome position), take a look at the reorder-only branch.

Install with conda on Linux

To install directly from source or to install on OSX, follow the instructions in the next section.

Spring is now available on conda via the bioconda channel. See this page for installation instructions for conda. Once conda is installed, do the following to install spring.

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda install spring

Note that if spring is installed this way, it should be invoked with the command spring rather than ./spring. The bioconda help page shows the commands if you wish to install spring in an environment. Also note that the bioconda version is compiled using SSE4.1 instruction set to allow portability across machines. You might get slightly better performance by compiling using the instructions below that use all available instructions on the target machine. Also, for older processors that don't support SSE4.1 instructions, you might get Illegal instruction error. In such cases, please use the instructions below.

Download

git clone https://github.com/shubhamchandak94/SPRING.git

Install

The instructions below will create the spring executable in the build directory inside SPRING. If you plan to build and run SPRING on separate architectures, then you might need to remove/comment the line set(FLAGS "${FLAGS} -march=native") in CMakeLists.txt (or use flags based on the target architecture). You can also use the -Dspring_optimize_for_portability=ON option for cmake that enables only the SSE4.1 instructions that should work on most processors.

On Linux with cmake installed and version at least 3.9 (check using cmake --version):

cd SPRING
mkdir build
cd build
cmake ..
make

On Linux with cmake not installed or with version older than 3.12:

cd SPRING
mkdir build
cd build
wget https://cmake.org/files/v3.12/cmake-3.12.4.tar.gz
tar -xzf cmake-3.12.4.tar.gz
cd cmake-3.12.4
./configure
make
cd ..
./cmake-3.12.4/bin/cmake ..
make

On macOS, install GCC compiler since Clang has issues with OpenMP library:

Install HomeBrew (https://brew.sh/)
Install GCC (this step will be faster if Xcode command line tools are already installed using xcode-select --install):

brew update
brew install gcc@9

Set environment variables:

export CC=gcc-9
export CXX=g++-9

Delete CMakeCache.txt (if present) from the build directory
Follow the steps above for Linux

Usage

Run the spring executable /PATH/TO/spring (or just spring if installed with conda) with the options below:

Allowed options:
  -h [ --help ]                   produce help message
  -c [ --compress ]               compress
  -d [ --decompress ]             decompress
  --decompress-range arg          --decompress-range start end
                                  (optional) decompress only reads (or read
                                  pairs for PE datasets) from start to end
                                  (both inclusive) (1 <= start <= end <=
                                  num_reads (or num_read_pairs for PE)). If -r
                                  was specified during compression, the range
                                  of reads does not correspond to the original
                                  order of reads in the FASTQ file.
  -i [ --input-file ] arg         input file name (two files for paired end)
  -o [ --output-file ] arg        output file name (for paired end
                                  decompression, if only one file is specified,
                                  two output files will be created by suffixing
                                  .1 and .2.)
  -w [ --working-dir ] arg (=.)   directory to create temporary files (default
                                  current directory)
  -t [ --num-threads ] arg (=8)   number of threads (default 8)
  -r [ --allow-read-reordering ]  do not retain read order during compression
                                  (paired reads still remain paired)
  --no-quality                    do not retain quality values during
                                  compression
  --no-ids                        do not retain read identifiers during
                                  compression
  -q [ --quality-opts ] arg       quality mode: possible modes are
                                  1. -q lossless (default)
                                  2. -q qvz qv_ratio (QVZ lossy compression,
                                  parameter qv_ratio roughly corresponds to
                                  bits used per quality value)
                                  3. -q ill_bin (Illumina 8-level binning)
                                  4. -q binary thr high low (binary (2-level)
                                  thresholding, quality binned to high if >=
                                  thr and to low if < thr)
  -l [ --long ]                   Use for compression of arbitrarily long read
                                  lengths. Can also provide better compression
                                  for reads with significant number of indels.
                                  -r disabled in this mode. For Illumina short
                                  reads, compression is better without -l flag.
  -g [ --gzipped_fastq ]          enable if compression input is gzipped fastq
                                  or to output gzipped fastq during
                                  decompression
  --gzip-level arg (=6)           gzip level (0-9) to use during decompression 
                                  if -g flag is specified (default: 6)
  --fasta-input                   enable if compression input is fasta file
                                  (i.e., no qualities)

Note that the SPRING compressed files are tar archives consisting of the different compressed streams, although we recommend using the .spring extension as in the examples shown below.

Resource usage

For the memory and CPU performance for SPRING, please see the paper and the associated supplementary material. Note that SPRING uses some temporary disk space, and can fail if the disk space is not sufficient. Assuming that qualities and ids are not being discarded and SPRING is operating in the short read mode, the additional temporary disk usage is around 10-30% of the original uncompressed file (on the lower end when quality values are from newer Illumina machines and are more compressible) when -r flag is not specified (i.e., default lossless mode). When -r flag is specified, SPRING writes all the quality values and read ids to a temporary file leading to significantly higher temporary disk usage - closer to 70-80% of the original file size. Note that these figures are approximate and include the space needed for the final compressed file.

Example Usage of SPRING

This section contains several examples for SPRING compression and decompression with various modes and options. The compressed SPRING file uses the .spring extension as a convention. If installed using conda, use the command spring instead of ./spring.

For compressing file_1.fastq and file_2.fastq losslessly using default 8 threads (Lossless).

./spring -c -i file_1.fastq file_2.fastq -o file.spring

For compressing file_1.fastq.gz and file_2.fastq.gz (gzipped fastq files) losslessly using default 8 threads (Lossless).

./spring -c -i file_1.fastq.gz file_2.fastq.gz -o file.spring -g

Using 16 threads (Lossless).

./spring -c -i file_1.fastq file_2.fastq -o file.spring -t 16

Compressing with only paired end info preserved, ids not stored, qualities compressed after Illumina binning (Recommended lossy mode for older Illumina machines. For Novaseq files, lossless quality compression is recommmended).

./spring -c -i file_1.fastq file_2.fastq -r --no-ids -q ill_bin -o file.spring

Compressing with only paired end info preserved, ids not stored, qualities binary thresholded (qv < 20 binned to 6 and qv >= 20 binned to 40).

./spring -c -i file_1.fastq file_2.fastq -r --no-ids -q binary 20 40 6 -o file.spring

Compressing with only paired end info preserved, ids not stored, qualities quantized using qvz with approximately 1 bit used per quality value.

./spring -c -i file_1.fastq file_2.fastq -r --no-ids -q qvz 1.0 -o file.spring

Compressing only reads and ids.

./spring -c -i file_1.fastq file_2.fastq --no-quality -o file.spring

Compressing single-end long read Fastq losslessly.

./spring -c -l -i file.fastq  -o file.spring

For single end file, compressing without order preserved.

./spring -c -i file.fastq -r -o file.spring

For single end file, compressing with order preserved (lossless).

./spring -c -i file.fastq -o file.spring

Decompressing (single end) to file.fastq.

./spring -d -i file.spring -o file.fastq

Decompressing (single end) to file.fastq, only decompress reads from 400 to 10000000.

./spring -d -i file.spring -o file.fastq --decompress-range 400 1000000

Decompressing (paired end) to file.fastq.1 and file.fastq.2.

./spring -d -i file.spring -o file.fastq

Decompressing (paired end) to file_1.fastq and file_2.fastq.

./spring -d -i file.spring -o file_1.fastq file_2.fastq

Decompressing (paired end) to file_1.fastq.gz and file_2.fastq.gz.

./spring -d -i file.spring -o file_1.fastq.gz file_2.fastq.gz -g

Decompressing (paired end) to file_1.fastq and file_2.fastq, only decompress pairs from 4000000 to 8000000.

./spring -d -i file.spring -o file_1.fastq file_2.fastq --decompress-range 4000000 8000000

Compressing file_1.fasta and file_2.fasta (fasta files without qualities) losslessly using default 8 threads (Lossless).

./spring -c -i file_1.fasta file_2.fasta -o file.spring --fasta-input

Compressing (paired end) to file_1.fasta and file_2.fasta (previous example contd.).

./spring -d -i file.spring -o file_1.fasta file_2.fasta

spring's People

Contributors

Stargazers

Watchers

spring's Issues

Suitable for production use?

Hi, would you consider this code to be suitable for use in a production system? So far I'm finding this to be the best contender for being able to truly recapitulate the input data and it could save a significant amount of storage and costs for our users.

Use would be within University of Cambridge and UK NHS research environments.

Memory requirements

Hi,
I tested your tool on whole genome FASTQ files and the result seems quite promising. But a major drawback is the high memory consumption. It needed for a 160GB gzipped input around 128GB of RAM. Is this amount of memory expected and if so is there any way to reduce it?

Best Regards,
Leon

conda install not working

After installing from conda (bioconda), I am unable to run Spring. Any calls to the command die with 'Illegal instruction', which suggests a compiler incompatibility. I am on an x86-64 system running Debian buster. I am able to clone from git, build, and run the resulting binary without problems.

how to get the gz format file

My dear

thank you for excellent software for compress，but when I was using your package，was there no way to get a fastq.gz decompress file?

may you patched it for decompress output?

thankyou!

Compress fasta sequences

Hello,

I was wondering if Spring could be used to compress fasta sequences. I have files that contain only nucleotide sequences and no quality scores.

Thank you.

MD5 values are different

The compression rate of the software is great. During the test, the same parameters were used to compress the Fastq file. When the parameters and machine configuration were exactly the same, why were the MD5 values different between the two times? Lossless compression is used

installation error

Hi Shubham,
I am trying to install SPRING in Ubuntu 16.04LTS with cmake version 3.14.7

During Installation i get the error. Can you please look into the error and advice me how to proceed . You help would be highly appreciable.

Thanks
Samarth

CMake Error at boost-cmake/cmake/Modules/DownloadBoost.cmake:66 (file):
file DOWNLOAD HASH mismatch

for file: [/home/sgrh/tools/Spring-master/boost-cmake/boost_1_67_0.tar.xz.tmp]
  expected hash: [4256a98911fbc943f7d97aab5f14f2ecde5341d90c5c5441696b7d99425af8fd]
    actual hash: [e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855]
         status: [1;"Unsupported protocol"]

How to read .spring compressed file

Hi Shubham,
Thanks for the fantastic tool. Superb!!!
Can you suggest a tool such as vi to quickly peek through the .spring formatted file after compression of paired end reads?

Shripathi

Allow `--fast` gzip compression on decode

Generally the file will be transient and level-1 offers a considerable saving in time over 6, as well as significant temporary storage over uncompressed.

This is more important when the output will be a pair of files.

Create an official docker image

I was wondering if you are considering creating an official docker image?

Tanks

Performance benchmark

Hi,
I find SPRING a really wonderful work. And besides the compression ratio, I think the compression and decompression speed are also important. So have you ever benchmarked SPRING and other tools such as DSRC2? In my test, DSRC2 is able to compress 1GB of FASTQ file in less than one second.
Best,
Zekun

Spring's QVZ lossy compressor for PacBio HiFi quality streams?

Hi, I'm looking to see if I could use the QVZ lossy compressor in SPRING for PacBio HiFi reads. I'm currently testing on the following FASTQ file (https://github.com/PacificBiosciences/DevNet/wiki/Sequel-II-System-Data-Release:-Universal-Human-Reference-(UHR)-Iso-Seq), but gdb shows that it gave me a Segfault at spring::qvz::pmf_increment (index=93, pmf=0x55555c3a34c0).

Does it seem to be a problem with the entries for the quality scores in HiFi reads? I'm not exactly sure what to modify in the code, but I found one #define ALPHABET_SIZE 72 in qvz.h and one #define ALPHABET_INDEX_SIZE_HINT 72 in pmf.h. I tried setting them to 125 as HiFi reads have quality score range [33, 126]. While the segfault goes away, compressing one single file (as linked above) with rate set to 1.0 for QVZ seems to take forever (it was stuck in "Preprocessing"; I halted after like 20 minutes) on an i7-12700H laptop with 20G free RAM (with default -t 8). Am I missing something with the above modification?

Thanks very much for your help!

document MAGIC number at start of spring files

The first few bytes of most binary file formats contain a format specific header which serves as a fingerprint for automatically classifying the file. This is often termed the magic number string. e.g.

https://en.wikipedia.org/wiki/List_of_file_signatures

It looks like your files start with 0x2e 0x2f 0x00 0x00 ...

MD5 hash mismatch observed in compressed/decompressed files with Spring

Description

I'm using Spring for compression and decompression of my FASTQ files. However, I'm experiencing an issue where the hash of the compressed and decompressed files in the first iteration is different from the hash of the original file. After the first iteration, the hashes of the compressed and decompressed files are the same.

Reproducible Code

1_sample2_R1.fastq.gz

fastq_1="sample2_R1.fastq.gz"
prefix=${fastq_1%%.fastq.gz}

for i in {1..5}
do 
    it=$((i + 1)) 
    spring -c -g -i ${i}_${prefix}.fastq.gz -o ${i}_${prefix}.spring
    spring -d -g -i ${i}_${prefix}.spring -o ${it}_${prefix}.fastq.gz

    input_md5=`md5sum ${i}_${prefix}.fastq.gz | cut -d' ' -f1`
    output_md5=`md5sum ${it}_${prefix}.fastq.gz | cut -d' ' -f1`

    echo "iteration ${i} input: $input_md5 output: $output_md5" >> results.txt
done

cat results.txt

Result

iteration 1 input: 43df7c96f72f02ba601458b742a7b57c output: c3a55c1ec9987fbbba73744ed2b214ac
iteration 2 input: c3a55c1ec9987fbbba73744ed2b214ac output: c3a55c1ec9987fbbba73744ed2b214ac
iteration 3 input: c3a55c1ec9987fbbba73744ed2b214ac output: c3a55c1ec9987fbbba73744ed2b214ac
iteration 4 input: c3a55c1ec9987fbbba73744ed2b214ac output: c3a55c1ec9987fbbba73744ed2b214ac
iteration 5 input: c3a55c1ec9987fbbba73744ed2b214ac output: c3a55c1ec9987fbbba73744ed2b214ac

Expected Output

I expect the hash of the compressed and decompressed files to match the hash of the original file for all iterations.

Observed Output

In the first iteration, the hash of the compressed file is 43df7c96f72f02ba601458b742a7b57c and the hash of the decompressed file is c3a55c1ec9987fbbba73744ed2b214ac. However, in all subsequent iterations, the hash of the compressed and decompressed files match and are c3a55c1ec9987fbbba73744ed2b214ac.

Questions

Can you provide any insight into what may be causing this issue and if there are any potential solutions or workarounds?

Thanks

Read of length 0 detected. Aborting...

First off, great tool for lossless compression of fastq files. Perfect candidate for archival of data.

We have a fastq pair which have many low quality reads. When I try to compress I get the error mentioned in the title.

We have trimmed the fastq files using trimmomatic, but we want to compress the original file for archival. Is there any way to by-pass this behaviour, or would you recommend against using SPRING for this type of data?

Issue installing with conda

Hi, I'm trying to install with conda following your steps in the README and get the following error:

conda install spring
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are not available from current channels:

  - spring

Current channels:

  - https://conda.anaconda.org/conda-forge/osx-64
  - https://conda.anaconda.org/conda-forge/noarch
  - https://conda.anaconda.org/bioconda/osx-64
  - https://conda.anaconda.org/bioconda/noarch
  - https://repo.anaconda.com/pkgs/main/osx-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/osx-64
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.

I also tried conda install -c bioconda spring with the same result.

How to calculate field-wise compression results

Hi Shubham,

In the supplementary results of the SPRING paper (section 3,1), field-wise results are given for reads, quality, and identifier compression for both SPRING and FaStore. I would like to repeat the same for my data, but couldn't find any options in the spring --help menu or any relevant code in this repository to do this.

Could you please provide the code for obtaining field-wise compression results or point me to a relevant resource to do this?

Best,
Abhay Rastogi (IIT-D)

How to work with spring compression in C++

Hi and thanks for developing Spring,
I was wondering, is it possible to work with spring format using C++ without first decompressing it? When using gzip or bz2 compression I'm using boost:iostreams zlib and bz2lib to work directly in my cpp scripts. Is there a way to do something like that with Spring? Or is there a zcat analog that I can use to pipe .spring file to my script?
Cheers,
Artem

build issues on Ubuntu 1604

Hi, saw your nice talk at ISMB.

On an Ubuntu 1604 cluster, I can't build the software because the cmake is too old. Fine.

Unfortunately, I can't make cmake via the workaround either (!).

Although libssl-dev is present, I get an error
cmake-3.12.1$ sudo apt install libssl-dev

./configure
make -j 8

error:
fatal error: openssl/hmac.h: No such file or directory

The solution for this error seems to be installing libssl-dev, which is installed.

Do you recommend I go to an earlier cmake, eg 3.10 ?

Or are you planning to build releases for eg Ubuntu 1604 ?

Thanks,
Colin

Errors related to read length

Hi,

Thanks a lot for developing this tool! I am trying to use it to compress several sets of paired-end fastq files, but I keep getting errors related to read length. For example, on several sets of files I get the following:

spring -c -i LC-PBMC-M5_1st/LUNGPBMCM51st5p_S1_L004_R1_001.fastq.gz LC-PBMC-M5_1st/LUNGPBMCM51st5p_S1_L004_R2_001.fastq.gz -o LUNGPBMCM51st5p_S1_L004_001.spring

Temporary directory: ./tmp.tzxlYvKWrO/ Starting compression... Preprocessing ... Max read length without long mode is Max read length without long mode is 511, but found read of length 597 511, but found read of length 593 terminate called recursively terminate called recursively terminate called recursively terminate called recursively terminate called recursively

For others I get an error stating that the read length is not equal to the quality length.

The strange thing is, these are short-read files. R1 is 26bp and R2 is 91bp. If I run them through fastqc I get no outlier reads (they all have the same length), and if I just have a look at the first few reads there is nothing obviously strange, e.g.:

@A00721:163:H7CH5DSXY:4:1101:2356:1000 2:N:0:GCCATTCC CTGCCCAGCGGTATCCCAAAGCTGCAAACTGAAGGGAATGCCCAGCACCTCAAATCGTTCCATCTCGAAGTCCACTCCAATGGTGGCCTAG + F:FFFFFFFFFFF,FF:FFFFFFFFFFFFFFFFFFFF:FFFFF:FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF,F @A00721:163:H7CH5DSXY:4:1101:2537:1000 2:N:0:GCCATTCC ATTTTGGCCAGAGGCCCTCTTTTACTGAGAACAAAATGTGCGTAGAACATTGTTCTGGCTGGCTATGTAAACAGAAGAAAACCTTGCTCTC + FFFFFFFFF:F:F:FFFFFFF:,FFFFFFFFFF::FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFF,F::FFF,FFFFFFFF:FFFFF:F @A00721:163:H7CH5DSXY:4:1101:2935:1000 2:N:0:GCCATTCC GGTGTGGTGCCAGATCTTCTCCATGTCGTCCCAGTTGGTGACGATGCCATGCTCAATGGGGTACTTCAGGGTCAGGATGCCACGCTAGCTC + FF:F:FFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF,FF:FF:FFFFFFFFFFFFFFFFFFFFF:FFFFF:FF,F @A00721:163:H7CH5DSXY:4:1101:3079:1000 2:N:0:GCCATTCC CATCAACACCTTTAATCACACTCCTTTGAGCTTGATCACCCAATTTTTCTGACTCCTTAATAGCAATACAGGAAGGGATGATAAACAGTGG + FF:FFFFFFFF:F,FFFFF,FFFF,:F,F:FFFFFFFFFFFF,FFFF:FFFFFFFFFFFFF,FFFFFFF:FFFFFFFFFFFFF::F,FF,:

These are files that I received from a collaborator, so I can't rule out some strange preprocessing step upstream. Is there any way to print out which read(s) is causing the problem, so that I can try to troubleshoot?

Thanks a lot for the help!
Best,
Sarah

why only one output file

When compressing a pair of input fastq file (paired end reads)

spring can take two input and two output file
("output-file,o",
po::value<std::vectorstd::string >(&outfile_vec)->multitoken(),
"output file name (for paired end decompression, if only one file is "
"specified, two output files will be created by suffixing .1 and .2.)")

But in the compress function:
41 void compress(const std::string &temp_dir,
42 const std::vectorstd::string &infile_vec,
43 const std::vectorstd::string &outfile_vec, const int &num_thr,
44 const bool &pairing_only_flag, const bool &no_quality_flag,
45 const bool &no_ids_flag,
46 const std::vectorstd::string &quality_opts,
47 const bool &long_flag, const bool &gzip_flag, const bool &fasta_flag) {
...
57 std::string infile_1, infile_2, outfile;
...

81 if (outfile_vec.size() == 1)
82 outfile = outfile_vec[0];
83 else {
84 std::cerr << LINE << ": why output file must be single?\n";
85 throw std::runtime_error("Number of output files not equal to 1");
86 }
Why the program crashed when you provide two output files? I have not spend enough time reading the code. Could someone help?

Support for gzipped Fastq (.fastq.gz)

Hello,
Do you support compression from .fastq.gz (gzipped fastq)? Or do you recommend unzipping and then running SPRING?

Thanks,
Phil

Crash on some data

Spring compression crashes on some data with "Not enough memory!" message. The machine (Ubuntu) has 128 GB of RAM, so physical RAM should not be an issue. The input data size is 6.5 GB.

I uploaded the test data here: http://kirill.med.u-tokai.ac.jp/data/temp/spring-repro-1.fq.gz (0.9 GB compressed, 6.5 GB decompressed).

Spring command: spring -c -i h.fq -o h.spring -w . -t 1 -l --no-quality

Output:

Temporary directory: ./tmp.fa37JncCHr/
Long flag detected.
Starting compression...
Preprocessing ...
Not enough memory! Please check README file for more information.
terminate called after throwing an instance of 'std::runtime_error'
  what():  BSC error.
Aborted (core dumped)

Note the spring seems to work fine on other data (including much larger data) on the same machine.

Full steps to reproduce on Ubuntu machine:

cd /tmp
mkdir spring-repro-1
cd spring-repro-1
git clone https://github.com/shubhamchandak94/SPRING.git
cd SPRING
mkdir build
cd build
cmake ..
make
cd ../..
wget http://kirill.med.u-tokai.ac.jp/data/temp/spring-repro-1.fq.gz
gzip -dc spring-repro-1.fq.gz >h.fq
./SPRING/build/spring -c -i h.fq -o h.spring -w . -t 1 -l --no-quality

By the way, the message is confusing: "Not enough memory! Please check README file for more information." - the README.md file does not seem to have information relevant to this crash.

Let me know if you need more details.

Spring to Conda

Hello, thanks for spring. The speed and compress ratio sounds amazing.

Did you consider submitting it to conda? That would make the program so much more accessible.

Running into errors

Hello,

I am using your software to compress fastq files but I am running into the following error.

*** Preprocessing ***
../sagnik/softwares/SPRING/spring: line 52: /usr/bin/time: No such file or directory

The command I am running is nohup ../sagnik/softwares/SPRING/spring -c -1 18359_CI16151_Rar3_raw_data/s_1_1_sequence.txt -t 20 -q qvz -o 18359_CI16151_Rar3_raw_data/s_1_1_sequence.txt.spring -i

Thank you.

recommended file extension

People are used to various compression convention like .gz for gzip. I see no such recommendation in the paper or README, not even in the examples.

I think adoption of spring would be greatly helped with a similar convention here, e.g. adding .spring or similar.

e.g. example.fastq vs example.fastq.gz vs example.fastq.spring

loss in the compression of .gz files

I used spring -c to compress with default options a PE fastq.gz files . then I used spring -d option to decompress them to fastq.gz file. the size of decompressed files are smaller than the original files and also the md5sum is different for both forward and reverse reads. it seem in the compression process it automatically eliminates some reads. But there was not such a problem in compression of fastq files and it was lossless.
Which paramater should i enter to use it for .gz files in lossless mode? --gzip-level?

Software dependencies

Hello,

This is more of a recommendation than an issue. SPRING depends other softwares like 7z. I would recommend that you check if those programs are loaded in memory before attempting to execute them. This will help save time since all the computations of the previous steps wont be useful if the softwares are not loaded.

Thank you.

Additional 'A' in FastQ Sequence

Hello,
I think I found a bug in your software:
If you use FastQ files which are generated on Windows (CR LF line ending) the decompressed FastQ files contain a additional 'A' at the end of each sequence.

Best Regards,
Leon Schütz

Pipping a fastq file

Is it possible to pipe a fastq file with Spring?

I 'd like to use the binary thresholding in the compression step to reduce the size of a fastq file, but I want to generate a converted fastq file, not a .spring file. Is it possible to start from a large.fastq and get a small.fastq in which the scores have been transformed to binary to reduce the file size?

Compression fail with error "CRITICAL tar: .: file changed as we read it"

Do other users also get multiple compression jobs failing with error message CRITICAL tar: .: file changed as we read it? I am confident we have no processes that change the content of our fastq.gz files.

If other users got this error as well, perhaps a --ignore-failed-read flag should be added one of the tar commands somewhere in the code? This may sound risky for users that do have processes which change the content of the fastq files, but each user can add checksums like sha256 or md5 to the compression workflow.

Error when using XARGS

I try to use spring througth xargs.

function my_func() {
# arg1 = R1
# arg2 = options

R1=$(basename $1)
R2=${R1/_R1/_R2}
name=${R1%_R1*}
options=$2

spring -c -g $options -o $name.spring -i $R1 $R2; echo

}

export -f my_funct

options="--num-threads 8"
echo ${files[*]} |xargs -d" " -I{} -n1 -P$num bash -c 'my_func "$@"' _ {} "$options"

And for some fastq.gz files the following error is thrown:
terminate called after throwing an instance of 'std::runtime_error'
what(): Cannot create temporary directory.
environment: line 1: 25179 Aborted (core dumped) spring -c -g $options -o $name.spring -i $R1 $R2

It appears that this issue only occurs when running the script on a remote drive.
The same fastq files on the local drive don't throw the error.

Running into an issue

Hello,

I am running the following command and I am running into the following problem.

spring -c -1 A1_ATCACG_L001_R1_001.fastq -p -t 10 -i -o A1_ATCACG_L001_R1_001.fastq.spring.compressed

The error I am getting is:

*** Preprocessing ***
Max Read length: 100
Total number of reads: 31599231
Total number of reads without N: 31415745
Preprocessing Done!

real 0m41.783s
user 0m13.218s
sys 0m24.238s
Reading file: ./tmp.tf8zs715T7/input_clean.dna
Constructing dictionaries
Reordering reads
Reordering done, 1985451 were unmatched
Writing to file
Done!

real 1m38.688s
user 12m28.488s
sys 0m47.109s
Maximum Read length: 100
Number of non-singleton reads: 30165601
Number of singleton reads: 1250144
Number of reads with N: 183486
Encoding reads
Encoding done:
655929 singleton reads were aligned
128993 reads with N were aligned

real 0m25.394s
user 2m21.037s
sys 0m8.393s
This is bsc, Block Sorting Compressor. Version 3.1.0. 8 July 2012.
Copyright (c) 2009-2012 Ilya Grebnov [email protected].

./tmp.tf8zs715T7/read_pos.tar compressed 30965760 into 6323226 in 2.938 seconds.
This is bsc, Block Sorting Compressor. Version 3.1.0. 8 July 2012.
Copyright (c) 2009-2012 Ilya Grebnov [email protected].

./tmp.tf8zs715T7/read_noise.tar compressed 45281280 into 7072242 in 4.488 seconds.

Can't load './7z.dll' (./7z.so: cannot open shared object file: No such file or directory)

ERROR:
7-Zip cannot find the code that works with archives.

Please help with this.

Thank you.

Few other issues

Hello,

Here is a list of few other issues I encountered.

At line 52, you are attempting to execute ./$harcdir/bin/preprocess.out. I do not have the program spring in my path. So I am providing the entire directory name to point to the file location. the '.' before $harcdir is attempting to search in an incorrect location.
The same issue crops up in lines 57, 58, 85, 86, 88, 91, 92, 97, 103, 108, 111, 116

Thank you.

New feature: just re-ordering the reads in FASTQ without compression

The re-ordering of the reads in FASTQ files the way as it is done in SPRING and HARC is incredibly and really useful operation for a lot of other things than compression of FASTQ files, like for example, extending the reads sequences, filling the gap for paired-end reads, etc.

Would be possible to add a new feature to SPRING such that it would write in the output just the re-ordered FASTQ file instead of compressing it (instead of compressing it and decompressing it)? Now the only way to do this is to compress the FASTQ file and decompress it.

shubhamchandak94 / spring Goto Github PK

spring's Introduction

SPRING

Check out specialized tool for compressing nanopore long reads: https://github.com/qm2/NanoSpring

Install with conda on Linux

Download

Install

Usage

Resource usage

Example Usage of SPRING

spring's People

Contributors

Stargazers

Watchers

Forkers

spring's Issues

Description

Reproducible Code

Result

Expected Output

Observed Output

Questions

Recommend Projects

Recommend Topics

Recommend Org