xavierleroy / camlzip Goto Github PK
View Code? Open in Web Editor NEWReading and writing zip and gzip files from OCaml
License: Other
Reading and writing zip and gzip files from OCaml
License: Other
sometimes one needs to compress/decompress data that doesn't come directly from an {in,out}_channel
(as part of a bigger pipeline). Right now I can't find a way of doing that.
The Makefile in master uses the NATIVE_COMPILER
variable which only appeared in ocaml/ocaml@987b081 and so in 4.10
.
Hi,
Your library seems to be a good wrapper for zlib and it might be interesting to use it for Haxe. The next version of Haxe is planned to integrate with OCaml's package manager: OPAM.
What is the current state of this library relative to OPAM ? It seems that camlzip is on OPAM, but it was published by a third-party author. Do you know him, is it a reliable source ? It would be better if you were the person who published the package since it seems that this is the home of the project.
Is it possible to publish (opam-publish) a new release of camlzip which includes recents bug fixes ?
right now merlin has no clue about the functions or their documentation.
The problem lies in this snippet:
https://github.com/xavierleroy/camlzip/blob/master/zip.ml#L582-L587
The crc
reference is never updated by the callback.
The spec for the Zip format (https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT) says:
4.4.17.1 The name of the file, with optional relative path.
The path stored MUST NOT contain a drive or
device letter, or a leading slash. All slashes
MUST be forward slashes '/' as opposed to
backwards slashes '' for compatibility with Amiga
and UNIX file systems etc. If input came from standard
input, there is no file name field.
We just observed that using backwards slashes can effectively cause issues when unzipping on Linux (for Amiga, we couldn't check, unfortunately) with some tools. Such problem has also been reported e.g. here
We could argue that it's the responsibility of Camlzip users to know about this constraint, but it seems harmless and useful to add a note to mention it in the docstrings of functions taking an entry path argument.
Going a step further, Camlzip could replace \
with /
automatically in entry paths (and also fail on leading slashes or drive?).
@xavierleroy : I'm happy to propose a PR implementing either of these variants if you tell me which one you prefer.
For the interested reader, here is some extra context. Win32 enforces by default a restriction on path lengths to about 256 characters. This can be lifted with some global settings, which cannot be expected on a typical Windows machine. The practical workaround is rather to prepend \\?\
in front of the path, which lifts the restriction; but if we do that, we have to use backslashes in the path -- forwards slashes are normally allowed as well, but not when that prefix is used. This means that Windows applications using Camlzip and supporting long paths will need to juggle between backslashes (for opening the file manually, or passing an input file name to copy_file_to_entry
) and forward slashes (for the entry path name).
Since the notorious zlib update, I've been having hard-to-pin-down somewhat reproducible issues when processing large zip files (700 mb) and using a lot of memory while processing. I can reproduce them all the time with my core+camlzip processor on of my huge file that I am not able to share publicly, but been unable to extract to a smaller, reproducible case (even adding a debug output may make the issue go away).
The problem manifests itself as suddenly being unable to unpack zip file — the Zlib.Error "decompression error" on random files gets thrown, whereas the zip file is itself perfectly fine.
What I've been able to pinpoint so far is that it started occuring with a zlib commit madler/zlib@b516b4b — Mark Adler added some sanity checks,
The exception gets thrown in camlzip_inflateEnd, as zlib returns error,
the reason for exception is that inflateStateCheck checks the stream structure, and inside there's a "state" substructure that has a reverse pointer to the stream. The new check verifies if these streams are actually equal (madler/zlib@b516b4b#diff-327188edf18799ffbb5a51cc69f797e8R113) — and suddenly, they are not anymore.
Here's my zlib debug info,
# let lines = Zip.read_entry z entry |> String.split ~on:'\n' in ...
inflatestatecheck failed
strm 0x7f7c7cd8f7b0
state 0x2a35070
state->strm 0x7f7c887b5100
state->mode 16203 (distext)
Uncaught exception:
Zip.Error("weather.zip", "wlask.min", "decompression error")
Called from file "src/exn.ml", line 90, characters 6-10
I suppose that probably the garbage collector or something sometimes moves things around and the structure turns invalid, or something — — — any ideas?
(Up-to-date 64-bit archlinux, ocaml 4.04.0 and all via opam)
in zip.ml, replace "open_in" by
let open_in filename = let ic = Pervasives.open_in_bin filename in try let (cd_entries, cd_size, cd_offset, cd_comment) = read_ecd filename ic in let entries = read_cd filename ic cd_entries cd_offset (Int32.add cd_offset cd_size) in let dir = Hashtbl.create (cd_entries / 3) in List.iter (fun e -> Hashtbl.add dir e.filename e) entries; { if_filename = filename; if_channel = ic; if_entries = entries; if_directory = dir; if_comment = cd_comment } with exn -> Pervasives.close_in_bin ic; raise exn
The project README points to http://www.gzip.org/ . Wouldn't https://zlib.net/ be a better reference for zlib?
Some external consumers (e.g. numpy) expect uncompressed_size to be correct. Which is not true in current implementation for files larger than 4GB. Probably it is worth considering throwing exception on inputs exceeding 4GB.
Camlzip does not support ZIP64 extensions. We are currently running into limit of 64k files. Has anyone worked on adding ZIP64 support?
Hello
I'm currently trying to properly cross compile your library, but I'm facing an issue.
I notice the following in your README
- Edit the three variables at the beginning of the Makefile to reflect the location where Zlib is installed on your system. The defaults are OK for Linux.
But I really prefer using environment variable instead of modifying the sources.
It's easiest for the integration in a complete cross compilation build system
Could I suggest the following patch to avoid such issue ?
Thanks
Erwan
--- Makefile.ori 2017-11-07 11:41:26.375257045 +0100 +++ Makefile 2017-11-07 11:41:38.719314251 +0100 @@ -5,12 +5,12 @@ # The directory containing the Zlib library (libz.a or libz.so) # Leave empty if libz is in a standard linker directory -ZLIB_LIBDIR= +ZLIB_LIBDIR?= # ZLIB_LIBDIR=/usr/local/lib # The directory containing the Zlib header file (zlib.h) # Leave empty if zlib.h is in a standard compiler directory -ZLIB_INCLUDE= +ZLIB_INCLUDE?= # ZLIB_INCLUDE=/usr/local/include # Where to install the library. By default: sub-directory 'zip' of
Reading a large file with a huge amount of smaller files inside, it fails:
Assert_failure zip.ml:217:4
assert((cd_bound = (LargeFile.pos_in ic)) &&
(cd_entries = 65535 || !entrycnt = cd_entries));
Adding debug dump, I see that the entrycnt is not the same as cd_entries, with a hint that in this case cd_entries are truncated, and are not #ffff, but !entrycnt & 0xffff:
cdbound=34a57695, lpos=34a57695, cd_entries=284c, entrycnt=1284c
Suggested fix:
zip.ml
assert((cd_bound = (LargeFile.pos_in ic)) &&
- (cd_entries = 65535 || !entrycnt = cd_entries));
+ (cd_entries = 65535 || cd_entries = !entrycnt land 0xffff || !entrycnt = cd_entries));
Hi,
Is there a way to flush only the internal buffer of out channel?
Thank you in advance!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.