Coder Social home page Coder Social logo

Comments (6)

veldsla avatar veldsla commented on August 19, 2024 1

I can confirm that this works. Concatenated gz members are quite common in bioinformatics as well. The Bgzf standard uses this in combination with an index to allow random access on files or concurrent processing. The members are quite small (max 64Kb uncompressed data). Performance seems fine. Time to process a file is comparable to zcat (when I use the zlib feature. Miniz is about 1.6 times slower, but this is also the case in single member gz files)

Partial example that uses the fastq reader from rust-bio:

let mut reader = BufReader::new(file);
let mut r = fastq::Record::new();

loop {
    //loop over all possible gzip members
    match reader.fill_buf() {
        Ok(b) => if b.is_empty() { break },
        Err(e) => panic!(e)
    }

    //decode the next member
    let gz = flate2::bufread::GzDecoder::new(&mut reader).unwrap();
    let mut fqreader = fastq::Reader::new(gz);

    //loop over all records in this member
    loop {
        match fqreader.read(&mut r) {
            Ok(()) => {
                if r.is_empty() {
                    //current gz member finished, more to decode?
                    break;
                }
            },
            Err(err) => panic!(err)
        }
        //do stuff
    }
}

from flate2-rs.

alexcrichton avatar alexcrichton commented on August 19, 2024

Unfortunately this has also been reported before (#35 and #23), but due to other divergences from zlib (see #40), I'm somewhat inclined to just implement a zlib-compatible gz stream as perhaps a separate type...

from flate2-rs.

nostrademons avatar nostrademons commented on August 19, 2024

Is there an easy workaround? I don't mind writing extra code on my end, but I don't know any other way to reliably check for EOF, since the GzDecoder takes ownership of its underlying reader and so I can't keep a borrowed reference to the File to check eof().

I suppose for my use-case I could get rid of flate entirely and pipe the output of gunzip -c to stdin, but then I've got to wire everything up with shell scripts, and I'm limited to one input file.

from flate2-rs.

alexcrichton avatar alexcrichton commented on August 19, 2024

You may be able to do something like:

  • Use bufread::GzDecoder instead of read::GzDecoder.
  • Use &mut BufRead instead of a by-value T
  • When one stream finishes decoding, check to see if the buffer has any more bytes in it, and if it does create a new GzDecoder to decode the next stream.

Although perhaps not tested, the GzDecoder shouldn't consume any more bytes than it's supposed to, so if the streams are literally concatenated you should be able to detect this and just pick up where it left off.

I haven't tested this out yet, though, so it may not work :(

from flate2-rs.

alexcrichton avatar alexcrichton commented on August 19, 2024

To clarify, this is what I would expect:

extern crate flate2;

use std::io::prelude::*;
use std::io;

fn main() {
    let mut v = Vec::new();
    flate2::write::GzEncoder::new(&mut v, flate2::Compression::Best)
                             .write_all(b"foo")
                             .unwrap();
    flate2::write::GzEncoder::new(&mut v, flate2::Compression::Best)
                             .write_all(b"bar")
                             .unwrap();

    let mut data = &v[..];

    io::copy(&mut flate2::bufread::GzDecoder::new(&mut data).unwrap(),
             &mut io::stdout()).unwrap();
    io::copy(&mut flate2::bufread::GzDecoder::new(&mut data).unwrap(),
             &mut io::stdout()).unwrap();
}

It's crucial that you use bufread::GzDecoder instead of read::GzDecoer (as that may buffer too much), but otherwise while there's more data you can just keep decoding with a brand new GzDecoder.

from flate2-rs.

nostrademons avatar nostrademons commented on August 19, 2024

Thanks for looking into this. I've been working on other parts of my system lately, but I'll implement this solution when I return to the Rust code.

from flate2-rs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.