Coder Social home page Coder Social logo

rezipdoc's Introduction

ReZipDoc

A repack uncompressed & diff visualizer for ZIP based files stored in git repos.

Most git repos hosting Open Source Hardware should use ReZipDoc.

What is this?

git does not like binary files. They make the repo grow fast in size in MB (see delta compression), and when you try to see what changed in a commit, you only get this:

Binary files A and B differ!

... not very useful!

ReZipDoc solves both of these issues, though only for ZIP based files, which includes for example FreeCAD and LibreOffice files.

NOTE It does not work for all binary files!

HINT If you are unsure whether a file format is ZIP based, just try to look at it with a software that can peak into ZIP files.
On Linux or OSX: unzip -l someFile.xyz

So if you are storing ZIP based files in your git repo, you probably want to use ReZipDoc.

Index

Project state

This repo contains a heavily revised, refined version of ReZip (and ZipDoc), plus unit tests and helper scripts, which were not available in the original.

License GitHub last commit Issues

master: Build Status Open Hub project report

SonarCloud Status SonarCloud Coverage SonarCloud Bugs SonarCloud Vulnerabilities

How to use

If your git repo makes heavy use of ZIP based files, then you probably want to use ReZipDoc in one of these three ways:

  • install ZipDoc diff viewer - This allows you to see changes within you ZIP based files when looking at git history in a human-readable way. It does not change your past nor future git history.

    To use this, install with --diff only.

  • install ReZip filter - This will change your future git repos history, storing ZIP based files without compression.

    To use this, install with --commit --diff --renormalize.

  • install ReZip filter & filter repo - This changes both the past (<- Caution!) and future history of your repo.

    To use this, create a copy of the repo with filtered history.

Installation

The filter and diff tool require Java 8 or newer.

The helper scripts - which are mostly used for installing the filter - require a POSIX (~= Unix) environment. This is the case on OSX, Linux, BSD, Unix and even Windows, if git is installed.

The recommended procedure is to install the helper scripts once, and then use them to comfortably install the filter into local git repos.

NOTE
This downloads and executes an online script onto your machine, which is a potential security risk. You may want to check-out the script before running it.

Install helper scripts

NOTE
This has to be done once per developer machine.

They get installed into ~/bin/, and if the directory did not exist before, it will get added to PATH.

To install:

curl --silent --location \
  https://raw.githubusercontent.com/hoijui/ReZipDoc/master/scripts/rezipdoc-scripts-tool.sh \
  | sh -s install --path

To update (to latest development version):

curl --silent --location \
  https://raw.githubusercontent.com/hoijui/ReZipDoc/master/scripts/rezipdoc-scripts-tool.sh \
  | sh -s update --dev

To remove:

curl --silent --location \
  https://raw.githubusercontent.com/hoijui/ReZipDoc/master/scripts/rezipdoc-scripts-tool.sh \
  | sh -s remove

Install diff viewer or filter

NOTE
This has to be done once per repo.

This installs the latest release of ReZipDoc into your local git repo.

Make sure you already have installed the helper scripts on your machine.

Switch to the local git repo you want to install this filter to, for example:

cd ~/src/myRepo/

As explained in How to use, you now want to use one of the following:

  1. Install the diff viewer

    rezipdoc-repo-tool.sh install --diff
  2. Install the filter

    rezipdoc-repo-tool.sh install --commit --renormalize
  3. Filter the history & install the filter

    If you filter the repo history, the freshly created, filtered repo will already have the filter installed as above.

To uninstall the diff viewer and/or filter, run:

rezipdoc-repo-tool.sh remove

Install filter manually

Only use this if you can not use the above, for some reason.

  1. Build the JAR

    Run this in bash:

    cd
    mkdir -p src
    cd src
    git clone [email protected]:hoijui/ReZipDoc.git
    cd ReZipDoc
    mvn package
    echo "Created ReZipDoc binary:"
    ls -1 $PWD/target/rezipdoc-*.jar
  2. Install the JAR

    Store rezipdoc-*.jar somewhere locally, either:

    • (global) in your home directory, for example under ~/bin/
    • (repo - tracked) in your repository, tracked, for example under /tools/
    • (repo - local) recommended in your repository, locally only, under /.git/
  3. Install the Filter(s)

    execute these lines:

    # Install the add/commit filter
    git config --replace-all filter.reZip.clean "java -cp .git/rezipdoc-*.jar io.github.hoijui.rezipdoc.ReZip --uncompressed"
    
    # (optionally) Install the checkout filter
    git config --replace-all filter.reZip.smudge "java -cp .git/rezipdoc-*.jar io.github.hoijui.rezipdoc.ReZip --compressed"
    
    # (optionally) Install the diff filter
    git config --replace-all diff.zipDoc.textconv "java -cp .git/rezipdoc-*.jar io.github.hoijui.rezipdoc.ZipDoc"
  4. Enable the filters

    In one of these files:

    • (global) ${HOME}/.gitattributes
    • (repo - tracked) /.gitattributes
    • (repo - local) recommended /.git/info/attributes

    Assign attributes to paths:

    # This forces git to treat files as if they were text-based (for example in diffs)
    [attr]textual     diff merge text
    # This makes git re-zip ZIP files uncompressed on commit
    # NOTE See the ReZipDoc README for how to install the required git filter
    [attr]reZip       textual filter=reZip
    # This makes git visualize ZIP files as uncompressed text with some meta info
    # NOTE See the ReZipDoc README for how to install the required git filter
    [attr]zipDoc      textual diff=zipDoc
    # This combines in-history decompression and uncompressed view of ZIP files
    [attr]reZipDoc    reZip zipDoc
    
    # MS Office
    *.docx   reZipDoc
    *.xlsx   reZipDoc
    *.pptx   reZipDoc
    # OpenOffice
    *.odt    reZipDoc
    *.ods    reZipDoc
    *.odp    reZipDoc
    # Misc
    *.mcdx   reZipDoc
    *.slx    reZipDoc
    # Archives
    *.zip    reZipDoc
    # Java archives
    *.jar    reZipDoc
    # FreeCAD files
    *.fcstd  reZipDoc

Filter repo history

This always creates a new copy of the repository.

NOTE
This only filters a single branch.

Make sure you have the helper scripts installed and in your PATH.

This filters the master branch of the repo at ~/src/myRepo into a new local repo ~/src/myRepo_filtered, using the original commit messages, authors and dates:

rezipdoc-history-filter.sh \
	--source ~/src/myRepo \
	--branch master \
	--orig \
	--target ~/src/myRepo_filtered

It also works with an online source:

rezipdoc-history-filter.sh \
	--source "https://github.com/case06/ZACplus.git" \
	--branch master \
	--orig \
	--target /tmp/ZACplus_filtered

After doing this, the new, filtered repo will already have the filter installed, so future commits will be filtered.

Filtering example

We are going to run a script that filters the Zinc-Oxide Open Hardware battery (ZAC+) project repo, which has a header comment explaining what it does in detail.

In short, it downloads ReZipDoc helper scripts to ~/bin, adds that dir to PATH if it is not there yet, creates temporary git repos in /tmp/, and generates some command-line output.

Run it like this:

curl --silent --location \
  https://raw.githubusercontent.com/hoijui/ReZipDoc/master/scripts/rezipdoc-sample-filter-session.sh \
  | sh

Culprits

As described in gitattributes, you may see unnecessary merge conflicts when you add attributes to a file that causes the repository format for that file to change. To prevent this, Git can be told to run a virtual check-out and check-in of all three stages of a file when resolving a three-way merge:

git config --add --bool merge.renormalize true

Motivation

Many popular applications, such as Microsoft Office and Libre/Open Office, save their documents as XML in compressed zip containers. Small changes to these document's contents may result in big changes to their compressed binary container file. When compressed files are stored in a Git repository these big differences make delta compression inefficient or impossible and the repository size is roughly the sum of its revisions.

This small program acts as a Git clean filter driver. It reads a ZIP file from stdin and outputs the same ZIP content to stdout, but without compression.

pros
  • human readable/plain-text diffs of (ZIP based) archives, (if they contain plain-text files)
  • smaller overall repository size if the archive contents change frequently
cons
  • slower git add/git commit process
  • slower checkout process, if the smudge filter is used

How it works

When adding/committing a ZIP based file, ReZip unpacks it and repacks it without compression, before adding it to the index/commit. In an uncompressed ZIP file, the archived files appear as-is in its content (together with some binary meta-info before each file). If those archived files are plain-text files, this method will play nicely with git.

Benefits

The main benefit of ReZip over Zippey, is that the actual file stored in the repository is still a ZIP file. Thus, in many cases, it will still work as-is with the respective application (for example Open Office), even if it is obtained without going through the re-packing-with-compression smudge filter, so for example when downloading the file through a web-interface, instead of checking it out with git.

Observations

The following are based on my experience in real-world cases. Use at your own risk. Your mileage may vary.

SimuLink

  • One packed repository with ReZip was 54% of the size of the packed repository storing compressed ZIPs.
  • Another repository with 280 *.slx files and over 3000 commits was originally 281 MB and was reduced to 156 MB using this technique (55% of baseline).

MS Power-Point

I found that the loose objects stored without this filter were about 5% smaller than the original file size (zLib on top of zip compression). When using the ReZip filter, the loose objects were about 10% smaller than the original files, since zLib could work more efficiently on uncompressed data. The packed repository with ReZip was only 10% smaller than the packed repository storing compressed zips. I think this unremarkable efficiency improvement is due to a large number of *.png files in the presentation which were already stored without compression in the original *.pptx.

Based on

  • ReZip For more efficient Git packing of ZIP based files
  • ZipDoc A Git textconv program to show text-based diffs of ZIP files

Similar Projects

  • png-inflate Does the same uncompressed repack for PNG image files

rezipdoc's People

Contributors

dependabot[bot] avatar hoijui avatar rockstorm101 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

rezipdoc's Issues

Error: java.util.zip.ZipException: invalid entry CRC

Hi, I manually installed the filters as per [1] in an empty git repository. Then I created a simple .odt file with just a single line. git commit the file, added a simple mod and tried to see the diff. Here is the error I got:

$ git diff
warning: CRLF will be replaced by LF in hello.odt.
The file will have its original line endings in your working directory
Exception in thread "main" java.util.zip.ZipException: invalid entry CRC (expected 0xecf40a9d but got 0x716ecc7d)
        at java.base/java.util.zip.ZipInputStream.read(ZipInputStream.java:224)
        at java.base/java.io.FilterInputStream.read(FilterInputStream.java:107)
        at io.github.hoijui.rezipdoc.Utils.transferTo(Utils.java:348)
        at io.github.hoijui.rezipdoc.ReZip.reZip(ReZip.java:248)
        at io.github.hoijui.rezipdoc.ReZip.reZip(ReZip.java:229)
        at io.github.hoijui.rezipdoc.ReZip.reZip(ReZip.java:201)
        at io.github.hoijui.rezipdoc.ReZip.main(ReZip.java:188)
error: external filter 'java -cp .git/rezipdoc-*.jar io.github.hoijui.rezipdoc.ReZip --compressed' failed 1
error: external filter 'java -cp .git/rezipdoc-*.jar io.github.hoijui.rezipdoc.ReZip --compressed' failed
Exception in thread "main" java.util.zip.ZipException: invalid entry CRC (expected 0xecf40a9d but got 0x716ecc7d)
        at java.base/java.util.zip.ZipInputStream.read(ZipInputStream.java:224)
        at java.base/java.io.FilterInputStream.read(FilterInputStream.java:107)
        at io.github.hoijui.rezipdoc.Utils.transferTo(Utils.java:348)
        at io.github.hoijui.rezipdoc.ZipDoc.transform(ZipDoc.java:153)
        at io.github.hoijui.rezipdoc.ZipDoc.transform(ZipDoc.java:125)
        at io.github.hoijui.rezipdoc.ZipDoc.main(ZipDoc.java:112)
fatal: unable to read files to diff

Java version info:

$ java --version 
openjdk 11.0.11 2021-04-20
OpenJDK Runtime Environment (build 11.0.11+9-post-Debian-1)
OpenJDK 64-Bit Server VM (build 11.0.11+9-post-Debian-1, mixed mode, sharing)

System info:

$ uname -a
Linux [...] 5.10.0-6-amd64 #1 SMP Debian 5.10.28-1 (2021-04-09) x86_64 GNU/Linux

$ lsb_release -a
Distributor ID:	Debian
Description:	Debian GNU/Linux 11 (bullseye)
Release:	11
Codename:	bullseye

Build Error: Failed test io.github.hoijui.rezipdoc.BinaryUtilTest

Hi, when trying to manually build as per the instructions [1] I get the test error below.

$ cd /tmp/ReZipDoc
$ mvn package
[...]
[ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.012 s <<< FAILURE! - in io.github.hoijui.rezipdoc.BinaryUtilTest
[ERROR] testManifestProperties(io.github.hoijui.rezipdoc.BinaryUtilTest)  Time elapsed: 0.009 s  <<< FAILURE!
org.junit.ComparisonFailure: 
expected:<                [Manifest-Version -> "1.0"
              Bundle-Description -> "A Git filter and textconv for converting ZIP based binary files to an uncompressed version of themselves, which works better with gits delta-compression and diffs"
                  Import-Package -> "javax.xml.namespace,javax.xml.parsers,javax.xml.transform,javax.xml.transform.dom,javax.xml.transform.stream,javax.xml.xpath,org.w3c.dom,org.xml.sax"
                     Bundle-Name -> "ReZipDoc"
                  Bundle-License -> "http://www.gnu.org/licenses/gpl-3.0.html"
                  Export-Package -> "io.github.hoijui.rezipdoc;uses:="javax.xml.parsers,javax.xml.transform,javax.xml.xpath,org.xml.sax";version="0.5.0""
                             -By -> "hoijui"
                      Created-By -> "Apache Maven Bundle Plugin"
                  Bundle-Version -> "0.5.0.SNAPSHOT"
          Bundle-ManifestVersion -> "2"
                Bnd-LastModified -> "1584876314307"
              Require-Capability -> "osgi.ee;filter:="(&(osgi.ee=JavaSE)(version=1.8))""
                            Tool -> "Bnd-2.4.1.201501161923"
                       Build-Jdk -> "1.8.0_151"
             Bundle-SymbolicName -> "io.github.hoijui.rezipdoc]"
> but was:<                [  Bundle-License -> "http://www.gnu.org/licenses/gpl-3.0.html"
                             -By -> "hoijui"
                Manifest-Version -> "1.0"
                      Created-By -> "Apache Maven Bundle Plugin"
                Bnd-LastModified -> "1584876314307"
                     Bundle-Name -> "ReZipDoc"
                       Build-Jdk -> "1.8.0_151"
              Bundle-Description -> "A Git filter and textconv for converting ZIP based binary files to an uncompressed version of themselves, which works better with gits delta-compression and diffs"
                  Import-Package -> "javax.xml.namespace,javax.xml.parsers,javax.xml.transform,javax.xml.transform.dom,javax.xml.transform.stream,javax.xml.xpath,org.w3c.dom,org.xml.sax"
                  Export-Package -> "io.github.hoijui.rezipdoc;uses:="javax.xml.parsers,javax.xml.transform,javax.xml.xpath,org.xml.sax";version="0.5.0""
          Bundle-ManifestVersion -> "2"
             Bundle-SymbolicName -> "io.github.hoijui.rezipdoc"
                  Bundle-Version -> "0.5.0.SNAPSHOT"
              Require-Capability -> "osgi.ee;filter:="(&(osgi.ee=JavaSE)(version=1.8))""
                            Tool -> "Bnd-2.4.1.201501161923]"
>
	at io.github.hoijui.rezipdoc.BinaryUtilTest.testManifestProperties(BinaryUtilTest.java:108)
[...]

Maven version info:

$ mvn --version
Apache Maven 3.6.3
Maven home: /usr/share/maven
Java version: 11.0.11, vendor: Debian, runtime: /usr/lib/jvm/java-11-openjdk-amd64
Default locale: en_GB, platform encoding: UTF-8
OS name: "linux", version: "5.10.0-6-amd64", arch: "amd64", family: "unix"

Java version info:

$ java --version 
openjdk 11.0.11 2021-04-20
OpenJDK Runtime Environment (build 11.0.11+9-post-Debian-1)
OpenJDK 64-Bit Server VM (build 11.0.11+9-post-Debian-1, mixed mode, sharing)

System info:

$ uname -a
Linux [...] 5.10.0-6-amd64 #1 SMP Debian 5.10.28-1 (2021-04-09) x86_64 GNU/Linux

$ lsb_release -a
Distributor ID:	Debian
Description:	Debian GNU/Linux 11 (bullseye)
Release:	11
Codename:	bullseye

Extend to be applicable to arbitrary archive formats, not just ZIP

This would include *.tar.gz, *.tar.bz2, and any other formats that are free and have free Java libraries available.

A likely candidate for a library to use woudl be Appache Commons Compress, which supports:

  • compressor algorithms: bzip2, Pack200, XZ, gzip, lzma, brotli, Zstandard, Z
  • archiving algorithms: ar, arj, cpio, dump, tar, 7z and zip

Not able to install scripts.

When I try to install scripts it fails with the following:

❯ curl --silent --location  https://raw.githubusercontent.com/hoijui/ReZipDoc/master/scripts/rezipdoc-scripts-tool.sh | sh -s install --path
sh action: installing (version: ) ...
sh: 195: return: Illegal number: 1:was

Ubuntu 20.04

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.