Coder Social home page Coder Social logo

Comments (11)

narkode avatar narkode commented on June 6, 2024 3

I had a similar issue with unzip on Debian-Linux. A workaround is to repair the original zip file:

zip -FF AmsterdamUMCdb-v1.0.2.zip --out AmsterdamUMCdb-v1.0.2_repaired.zip -fz

Afterwards, extracting the new file with unzip works without errors:

unzip AmsterdamUMCdb-v1.0.2_repaired.zip

Best regards,
Julian

from amsterdamumcdb.

patrickthoral avatar patrickthoral commented on June 6, 2024

The most likely reason is an incompatibility with the Mac’s default extraction program, 'Archive Utility', which does not support files over 4 GB. Other extraction utilities (e.g. Commander One/WinZip for Mac) should be fine.

from amsterdamumcdb.

nbenn avatar nbenn commented on June 6, 2024

@patrickthoral Thanks for getting in touch. I don't believe the macOS Archive Utility has anything to do with this, most of all due to the fact that the issue I'm describing shows up when running zipinfo from the command line.

Furthermore, I do not think the problem is limited to macOS. I can reproduce the issue under CentOS 7 for example, again running zipinfo v3.0.0

[nbennett@eu-login-18 aumc]$ zipinfo AmsterdamUMCdb-v1.0.2.zip
Archive:  AmsterdamUMCdb-v1.0.2.zip
Zip file size: 9143127113 bytes, number of entries: 7
warning [AmsterdamUMCdb-v1.0.2.zip]:  4848159318 extra bytes at beginning or within zipfile
  (attempting to process anyway)
error [AmsterdamUMCdb-v1.0.2.zip]:  start of central directory not found;
  zipfile corrupt.
  (please check that you have transferred or created the zipfile in the
  appropriate BINARY mode and that you have compiled UnZip properly)

If you want me to, I can also check on Fedora, but I honestly do not believe this is an OS issue in that sense, but rather that there is an issue with how the zip file was created.

from amsterdamumcdb.

patrickthoral avatar patrickthoral commented on June 6, 2024

I do not think there is a problem with the OS itself but the common source most unzip utilities are based upon. I could reproduce the same error a colleague had on a mac. The work around was to use Commander One or WinZip for mac. The file was created on a Windows system with the built-in archiving tools. For the next version, we'll check if there's a format/setting that won't mess up the default archiving tools on *nix based systems. If it won't extract at all, there is probably a transfer error.

from amsterdamumcdb.

nbenn avatar nbenn commented on June 6, 2024

The most likely reason is an incompatibility with the Mac’s default extraction program, 'Archive Utility', which does not support files over 4 GB. Other extraction utilities (e.g. Commander One/WinZip for Mac) should be fine.

For an enlightening SO post on this, see https://stackoverflow.com/a/59518097/3855417.

While you're correct that there still is an issue on macOS 10.15 when trying to create ZIP64 files using Archive Utility, the issue I'm reporting is not affected by this. As stated above, I'm on an Infozip 6.0 toolchain which does support extraction of proper ZIP64 files.

If it won't extract at all, there is probably a transfer error.

If you provide me with a file hash, I'm happy to check. But I'm pretty sure I have the complete file. I also believe that the zip archive you're currently distributing is non-conformant with the ZIP64 specification and therefore extraction will fail for all extraction utilities that are strict about this, such as the default unzip program on many Unix platforms. Are you positive that your zip program is using ZIP64 extensions (which is required to create a compliant zip archive containing files of this size)?

The work around was to use Commander One or WinZip for mac.

Unfortunately this does not work for my use-case. I'm trying to build a cross-platform pipeline for setting up the AUMC database. 7zip does extract the archive successfully (with warnings) but adding this as a dependency simply for extracting this one file seems unreasonable to me.

If you are planning on putting this off until a next release, do you have an eta on that?

from amsterdamumcdb.

patrickthoral avatar patrickthoral commented on June 6, 2024

You are are right this is a non-conformance issues, but on the part of those other tools. What happens is that zipinfo uses the Central End Record, ZIP64 Central End Record and ZIP64 Central End Locator incorrectly (not based on version 2 of the ZIP64 specification).

The official PKWARE (the developers of the standard) tools work fine with this file created by the licensed Windows Compressed Folders (part of Windows). In addition, the Python ZipFile module can also display the directory listing fine, however it does not support Deflate64, so extracting is not possible.

Indeed, a problem with the Zip standard is that it's implementations is not open source at all, but PKWARE proprietary technology and no official open source version exists. The only reason it exists today is because it has been in use for decades (since the MS-DOS era) and ended up in (licensed) technology (a de facto standard).

I will use the cross-platform ZipFile library for the next iteration (that will also imply using the better supported Deflate-algorithm as well), but there's no ETA as of yet. I don't understand though, what you mean that it sounds unreasonable to add that dependency. You are not allowed to distribute the files anyway to other users, so it's a one-time extraction.

from amsterdamumcdb.

nbenn avatar nbenn commented on June 6, 2024

@patrickthoral Thanks for looking into making extraction easier cross-platform.

You are not allowed to distribute the files anyway to other users

Obviously I'm not planning on distributing your data. I'm planning on distributing a pipeline in order to make obtaining results using your data (together with other datasets) more reproducible and (hopefully) easier to access. It is for such a pipeline where I'm trying to keep the number of dependencies as small as possible.

from amsterdamumcdb.

patrickthoral avatar patrickthoral commented on June 6, 2024

The files have been rezipped using the Deflate algorithm instead of the Deflate64 algorithm with the python ZipFile library. I've verified it to work on Windows, MacOS and Ubuntu with the built-in tools so should be safe to use in most environments. I'll notify you when the new file is available for download.

from amsterdamumcdb.

patrickthoral avatar patrickthoral commented on June 6, 2024

@jsassenscheidt @nbenn Indeed, most open source implementations have problems reading the directory (but interestingly not Python's zipfile library). The rezipped file has been uploaded to our transfer system, so I expect the file to be available for download for credentialed users in the next couple of days.
Python's implementation sadly misses a callback to determine progress when (un)zipping, which is unfortunate when handling large files, so if anybody is interested, i added some sample code in the tools folder to improve this.

from amsterdamumcdb.

nbenn avatar nbenn commented on June 6, 2024

@patrickthoral Thanks a lot for looking into this so swiftly. I'm happy to check it out. Just to clarify, did you bump the version number? If I have a file AmsterdamUMCdb-v1.0.2.zip for download, does that mean, the new file has not propagated through?

from amsterdamumcdb.

patrickthoral avatar patrickthoral commented on June 6, 2024

The version number stays the same (the data has not changed at all), but the new file should be available from DANS as of now.

from amsterdamumcdb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.