Hi all,
I was facing problems extracting the dataset available on https://zenodo.org/record/2620293. @MrtnMndt helped me resolve it. I will describe the problem here followed by the solution that worked.
md5 of the downloaded zip files is correct, but upon extraction I got the following problems:
- with
unzip CODEBRIM_classification_balanced_dataset.zip,
I got
Archive: CODEBRIM_classification_balanced_dataset.zip
warning [CODEBRIM_classification_balanced_dataset.zip]: 8589934592 extra bytes at beginning or within zipfile
(attempting to process anyway)
file #1: bad zipfile offset (local header sig): 8589934592
(attempting to re-compensate)
creating: classification_dataset_balanced/
error: invalid zip file with overlapped components (possible zip bomb)
- with
7z x CODEBRIM_classification_balanced_dataset.zip
, I got
...
Extracting classification_dataset_balanced/val/defects/image_0001304_crop_0000004.png
Extracting classification_dataset_balanced/val/defects/image_0001129_crop_0000002.png
Extracting classification_dataset_balanced/val/defects/image_0001126_crop_0000005.png
Extracting __MACOSX/classification_dataset_balanced/val/defects/._image_0001126_crop_0000005.png
Extracting classification_dataset_balanced/val/defects/image_0000300_crop_0000001.png
Sub items Errors: 7575
A closer observation showed the following for some files:
...
Extracting classification_dataset_balanced/train/background/image_0000531_crop_0000005.png Unsupported Method
Extracting classification_dataset_balanced/train/background/image_0000324_crop_0000003.png Unsupported Method
Extracting classification_dataset_balanced/train/background/image_0001311_crop_0000001.png Unsupported Method
Extracting classification_dataset_balanced/train/background/image_0000334_crop_0000005.png Unsupported Method
Extracting classification_dataset_balanced/train/background/image_0000030_crop_0000005.png Unsupported Method
Extracting classification_dataset_balanced/train/background/image_0000233_crop_0000005.png Unsupported Method
Extracting classification_dataset_balanced/train/background/image_0000020_crop_0000003.png Unsupported Method
Extracting classification_dataset_balanced/train/background/image_0000429_crop_0000004.png Unsupported Method
Extracting classification_dataset_balanced/train/background/image_0000969_crop_0000001.png Unsupported Method
Extracting classification_dataset_balanced/train/background/image_0000439_crop_0000002.png Unsupported Method
Extracting classification_dataset_balanced/train/background/image_0000001_crop_0000002.png Unsupported Method
Extracting classification_dataset_balanced/train/background/image_0000011_crop_0000004.png Unsupported Method
Extracting classification_dataset_balanced/train/background/image_0000202_crop_0000002.png Unsupported Method
Extracting classification_dataset_balanced/train/background/image_0000399_crop_0000003.png Unsupported Method
...
- With
file-roller
I could extract only a part of the data (~3GB) and some file were not readable.
![pastedImage](https://user-images.githubusercontent.com/6649096/160586837-69539539-f5f0-45fd-81bd-c9d8dbb41642.png)
@MrtnMndt pointed out that the operating system probably did not handle large zip file automatically (zip files beyond 4GB). He tried it on his Mac/window and the extraction was just fine and suggested I use a different extractor tool following https://askubuntu.com/questions/959256/cant-extract-a-large-zip-file.
The solution that worked
sudo apt-get install fastjar
jar xvf CODEBRIM_classification_balanced_dataset.zip
✔️ This is also the accepted solution on: https://unix.stackexchange.com/questions/438368/unix-unzip-is-failing-but-mac-archive-utility-works
❌ We also tried the following which did not work.
sudo apt-get install dtrx
dtrx CODEBRIM_classification_balanced_dataset.zip