Coder Social home page Coder Social logo

octodiff's Introduction

Octodiff is a 100% managed implementation of remote delta compression. Usage is inspired by rdiff, and like rdiff the algorithm is based on rsync. Octodiff was designed to be used in Octopus Deploy, an automated deployment tool for .NET developers.

Octodiff can make deltas of files by comparing a remote file with a local file, without needing both files to exist on the same machine. It does this in three phases:

  1. Machine A reads the basis file, and computes a signature of all the chunks in the file
  2. Machine B uses the signature and the new file, to produce a delta file, specifying what changes need to be made
  3. Machine A applies the delta file to the basis file, which produces an exact copy of the new file

Of course, the benefit is that instead of transferring the entire new file to machine A, we just transfer a small signature, and a delta file containing the differences. We're trading off CPU usage for potentially large bandwidth savings.

Octodiff is an executable, but can also be referenced and used as any .NET assembly.

Signatures

Usage: Octodiff signature <basis-file> [<signature-file>] [<options>]

Arguments:

      basis-file             The file to read and create a signature from.
      signature-file         The file to write the signature to.

Options:

      --chunk-size=VALUE     Maximum bytes per chunk. Defaults to 2048. Min of
                             128, max of 31744.
      --progress             Whether progress should be written to stdout

Example:

octodiff signature MyApp.1.0.nupkg MyApp.1.0.nupkg.octosig --progress

This command calculates the signature of a given file. As per the rsync algorithm, the signature is calculated by reading the file into fixed-size chunks, and then calculating a signature of each chunk. The resulting signature file contains:

  • Metadata about the signature file and algorithms used
  • Hash of the file that the signature was created from (basis file hash)
  • A list of chunk signatures, each 26 bytes long, consisting of:
    • The length of the chunk (short)
    • Rolling checksum (uint) - calculated using Adler32
    • Hash (20 bytes) - calculated using SHA1

Given that the default chunk size is 2048 bytes, and this is turned into a 26 byte signature, the resulting file is about 1.3% of the size of the original. For example, a 306MB file creates a 3.9MB signature file. The signature of a 300mb file can be calculated in ~3 seconds using ~6mb of memory on a 2013 Macbook Pro. Memory usage during signature calculation should remain constant no matter the size of the file.

Deltas

Usage: Octodiff delta <signature-file> <new-file> [<delta-file>] [<options>]

Arguments:

      signature-file         The file containing the signature from the basis
                             file.
      new-file               The file to create the delta from.
      delta-file             The file to write the delta to.

Options:

      --progress             Whether progress should be written to stdout

Example:

octodiff delta MyApp.1.0.nupkg.octosig MyApp.1.1.nupkg MyApp.1.0_to_1.1.octodelta --progress

This command creates a delta, that specfies how the basis-file (using just the information in its signature file) can be turned into the new-file. First, the signature file is read into memory. Then we scan the new file, looking for chunks that we find in the signature. You can learn more about the process in the rsync algorithm.

The delta file contains:

  • Metadata about the signature file and algorithms used
  • Hash of the file that the original signature was created from (basis file hash)
  • A series of instructions to re-create the new-file which reference the basis file.

Instructions are either copy commands (read offset X, length Y from the basis file) or data commands (add this data). Example:

  1. Copy 0x0000 to 0x8C00
  2. Data: 5C 9F D9 C7...
  3. Copy 0x8C31 to 0x93C0

The delta file uses a binary file format to keep encoding overhead to a minimum - copy instructions start with 0x60 and then the start offset and length; data commands are 0x80 followed by the length of the data and then the data to copy.

For debugging, you can use the following command to print an explanation of what is in a given delta file:

octodiff explain-delta MyApp.1.0_to_1.1.octodelta

Patching

Usage: Octodiff patch <basis-file> <delta-file> <new-file> [<options>]

Arguments:

      basis-file             The file that the delta was created for.
      delta-file             The delta to apply to the basis file
      new-file               The file to write the result to.

Options:

      --progress             Whether progress should be written to stdout
      --skip-verification    Skip checking whether the basis file is the same
                             as the file used to produce the signature that
                             created the delta.

Example:

octodiff patch MyApp.1.0.nupkg MyApp.1.0_to_1.1.octodelta MyApp.1.1.nupkg --progress

This command recreates the new-file using simply the basis-file and the delta-file.

Applying the delta is the easiest part of the process. We simply open the delta-file, and follow the instructions. When there's a copy instruction, we seek to that offset in the basis-file and copy until we hit the length. When we encounter a data instruction, we append that data. At the end of the process, we have the new-file.

Octodiff embeds a SHA1 hash of the new-file in the delta-file. After patching, Octodiff compares this hash to the SHA1 hash of the resulting patched file. If they don't match, Octodiff returns a non-zero exit code.

Performance

The following section isn't meant to be mathematically accurate, but to give you a rough idea of real-world performance to expect from Octodiff. The tests were done on a Windows 8 VM, running in a 2013 Macbook Pro, with 4 cores and 8GB of memory assigned to the VM. The machine uses an SSD which mean the I/O bound tasks could run significantly slower on non-SSD drives. All measurements were done using simply Windows Task Manager.

Signature creation is relatively easy - we're reading the file in fixed-size chunks and computing a checksum. Memory usage should be constant no matter how big the file is - around 8.2 MB.

  • The signature for an 85 MB file can be calculated in ~832 ms
  • The signature for a 4.4 GB file can be calculated in ~36 seconds

We also compute a SHA1 hash of the entire basis file (this takes around 1/3 of the total time above). The resulting signature file size is always ~1.3% of the basis file size.

Delta creation is the most CPU and memory-intensive aspect of the whole process. First, we assume that we can fit all signatures into memory, which means at a minimum we'll consume at least ~1.3% of the basis file in memory, plus extra to store a dictionary of the chunks and buffers as we read data. Budget for about 5x the signature file size in memory (e.g., for a 57 MB signature file (a 4.4 GB basis file), expect to use 250mb of memory).

  • Delta from a 85 MB file took 5 seconds
  • Delta from a 4.3 GB ISO took 170 seconds

Delta creation takes roughly the same amount of time whether there are many differences or none at all. If there are many differences, the resulting delta file will be much larger, so additional I/O producing it may have an impact.

Patching is the fastest part of the algorithm.

Output and exit codes

If all goes well, Octodiff produces no output. You can use the --progress switch to write progress messages to stdout.

Octodiff uses the following exit codes:

  • 0 - success
  • 1 - environmental problems
  • 2 - corrupt signature or delta file
  • 3 - internal error or unhandled situation
  • 4 - usage problem (you did something wrong, maybe passing the wrong file)

Using OctoDiff classes within your own application

To use the OctoDiff classes to create signature/delta/final files from within your own application, you can use the below example which creates the signature and delta file and then applies the delta file to create the new file.

// Create signature file
var signatureBaseFilePath = @"C:\OctoDiffExample\MyPackage.1.0.0.zip";
var signatureFilePath = @"C:\OctoDiffExample\Output\MyPackage.1.0.0.zip.octosig";
var signatureOutputDirectory = Path.GetDirectoryName(signatureFilePath);
if(!Directory.Exists(signatureOutputDirectory))
	Directory.CreateDirectory(signatureOutputDirectory);
var signatureBuilder = new SignatureBuilder();
using (var basisStream = new FileStream(signatureBaseFilePath, FileMode.Open, FileAccess.Read, FileShare.Read))
using (var signatureStream = new FileStream(signatureFilePath, FileMode.Create, FileAccess.Write, FileShare.Read))
{
	signatureBuilder.Build(basisStream, new SignatureWriter(signatureStream));
}

// Create delta file
var newFilePath = @"C:\OctoDiffExample\MyPackage.1.0.1.zip";
var deltaFilePath = @"C:\OctoDiffExample\Output\MyPackage.1.0.1.zip.octodelta";
var deltaOutputDirectory = Path.GetDirectoryName(deltaFilePath);
if(!Directory.Exists(deltaOutputDirectory))
	Directory.CreateDirectory(deltaOutputDirectory);
var deltaBuilder = new DeltaBuilder();
using(var newFileStream = new FileStream(newFilePath, FileMode.Open, FileAccess.Read, FileShare.Read))
using(var signatureFileStream = new FileStream(signatureFilePath, FileMode.Open, FileAccess.Read, FileShare.Read))
using(var deltaStream = new FileStream(deltaFilePath, FileMode.Create, FileAccess.Write, FileShare.Read))
{
	deltaBuilder.BuildDelta(newFileStream, new SignatureReader(signatureFileStream, new ConsoleProgressReporter()), new AggregateCopyOperationsDecorator(new BinaryDeltaWriter(deltaStream)));
}

// Apply delta file to create new file
var newFilePath2 = @"C:\OctoDiffExample\Output\MyPackage.1.0.1.zip";
var newFileOutputDirectory = Path.GetDirectoryName(newFilePath2);
if(!Directory.Exists(newFileOutputDirectory))
	Directory.CreateDirectory(newFileOutputDirectory);
var deltaApplier = new DeltaApplier { SkipHashCheck = false };
using(var basisStream = new FileStream(signatureBaseFilePath, FileMode.Open, FileAccess.Read, FileShare.Read))
using(var deltaStream = new FileStream(deltaFilePath, FileMode.Open, FileAccess.Read, FileShare.Read))
using(var newFileStream = new FileStream(newFilePath2, FileMode.Create, FileAccess.ReadWrite, FileShare.Read))
{
	deltaApplier.Apply(basisStream, new BinaryDeltaReader(deltaStream, new ConsoleProgressReporter()), newFileStream);
}

Development

You need:

Run Build.cmd to build, test and package the project.

To release to Nuget, tag master with the next major, minor or patch number, TeamCity will do the rest.

Every successful TeamCity build for all branches will be pushed to MyGet.

octodiff's People

Contributors

adam-mccoy avatar akirayamamoto avatar antmeehan avatar arabasso avatar borland avatar bretkoppel avatar dependabot[bot] avatar desruc avatar droyad avatar hnrkndrssn avatar kzu avatar matt-richardson avatar michaelongso avatar networm avatar paulstovell avatar stevencl840 avatar tw17 avatar uglybugger avatar yukitsune avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

octodiff's Issues

Reverse Diff?

Could we undo a file with octodiff that has been previously created if we have the diff files?

Support for non seekable streams as NetworkStream.

SignatureReader for instance, it makes use of Length and Position of stream. If the stream is a NetworkStream it would throw an exception. What i did to support it, is to send an empty end chunk in the client and create a copy of the SignatureReader class but removing that lines using position and length, and using the last chunk as the end of stream. BinaryDeltaReader does something like this also.

Unable to load the project in Visual Studio 2017

Hi While I am trying to load the project by sln file in Visual Studio 2017 but I am unable to do so as the projects are not loading .

Please help me with a version which will get loaded in visual studio

Documentation for running Octodiff as a library

Hello,

I am looking into calling Octodiff from within our C# application.

The README explains how to use the command line utility and I have been able to successful execute the library by calling the Main function of the program, like so:

Octodiff.Program.Main(new string[] { "signature", oldFilePath, outputPath });

I can see the types like Octodiff.Core.SignatureBuilder and Octodiff.Core.DeltaApplier, but would love some documentation on how to use it.

Unable to convert to VS 2017 project format

Attempting to open the project in VS 2017 community edition gives errors.

Octodiff\Octodiff.xproj: Failed to migrate XProj project Octodiff. 'dotnet migrate --skip-backup -s -x "...\Octodiff\source\Octodiff\Octodiff.xproj" "...\Octodiff\source\Octodiff\project.json" -r "...\AppData\Local\Temp\gup2fn3o.mk3" --format-report-file-json' exited with error code 1.

Octodiff\Octodiff.xproj: Migration failed. System.ArgumentNullException: Value cannot be null. Parameter name: version at NuGet.Versioning.VersionRangeBase.Satisfies(NuGetVersion version, IVersionComparer comparer) at NuGet.Versioning.VersionRangeBase.Satisfies(NuGetVersion version) at Microsoft.DotNet.ProjectJsonMigration.Rules.MigratePackageDependenciesAndToolsRule.<>c__DisplayClass11_0.b__2(VersionRange p) at System.Linq.Enumerable.TryGetFirst[TSource](IEnumerable1 source, Func2 predicate, Boolean& found) at System.Linq.Enumerable.FirstOrDefault[TSource](IEnumerable1 source, Func2 predicate) at Microsoft.DotNet.ProjectJsonMigration.Rules.MigratePackageDependenciesAndToolsRule.ToPackageDependencyInfo(ProjectLibraryDependency dependency, IDictionary2 dependencyToVersionMap) at Microsoft.DotNet.ProjectJsonMigration.Rules.MigratePackageDependenciesAndToolsRule.MigrateDependencies(Project project, MigrationRuleInputs migrationRuleInputs, NuGetFramework framework, IEnumerable1 dependencies, SlnFile solutionFile, ProjectItemGroupElement itemGroup) at Microsoft.DotNet.ProjectJsonMigration.Rules.MigratePackageDependenciesAndToolsRule.Apply(MigrationSettings migrationSettings, MigrationRuleInputs migrationRuleInputs) at Microsoft.DotNet.ProjectJsonMigration.DefaultMigrationRuleSet.Apply(MigrationSettings migrationSettings, MigrationRuleInputs migrationRuleInputs) at Microsoft.DotNet.ProjectJsonMigration.ProjectMigrator.MigrateProject(MigrationSettings migrationSettings) at Microsoft.DotNet.ProjectJsonMigration.ProjectMigrator.Migrate(MigrationSettings rootSettings, Boolean skipProjectReferences) at Microsoft.DotNet.Tools.MigrateCommand.MigrateCommand.Execute() at Microsoft.DotNet.Tools.Migrate.MigrateCommandCompose.Run(String[] args) at Microsoft.DotNet.Cli.Program.ProcessArgs(String[] args, ITelemetry telemetryClient) at Microsoft.DotNet.Cli.Program.Main(String[] args)

Stream that dont support Seek

I'm curious to know if you think the library can generate a Signature file from a stream that does not support Seeking.

If possible this would be a great benefit to the library, as you can generate a signature in real time from a download, without storing the thie file itself.

Prefix NuGet package with Octopus

Octopus Deploy has recently reserved the Octopus prefix on NuGet. To avoid dependency confusion when referencing multiple NuGet package sources, this prefix is being appended to owned packages. Once Octopus.Octodiff has been published to NuGet, the existing non-prefixed Octodiff package will be marked as deprecated.

Migration will involve simply changing project package references from Octodiff to Octopus.Octodiff.

When Syncing a complex file(Image) Causing Problem

I have one more question, I have noticed one thing that is if I using a complex( a png file of 1 mb or a jpg file) file the patching is not working properly. I mean to say that its not creating the new file.

Can You please tell me why is this happening.
If you want I will share you the delta, signature and the image files so that we will able to overcome the issue and make the thing perfect

Hope to hear back from you soon

Provided example does not work

Hi i tried the example from Using OctoDiff classes within your own application section in a dotnet 6 console app using the Octopus.Octodiff Version 2.0.261 package. The new file is not updated with the delta. Any ideas?

rdiff compatibility?

The README says:

inspired by rdiff, and like rdiff the algorithm is based on rsync

I was curious if I'd be able to sync files between rdiff (on *nix systems) and Octodiff on Windows. I tried it out and noticed there is definitely a superficial incompatibility - they use different magic numbers - but I wasn't sure if there was a deeper incompatibility.

Other than the magic numbers are there any major differences between octodiff and rdiff files? Do you think rdiff compatibility is possible in the future?

Npm Package that generate patch for octodiff

Hi Everyone,
In one of the projects, backend team use octodiff binary patch for csv files.

Now as frontend developer I want to send patch data in payload. Where can I find npm library that can generate this patch.

Reporting a vulnerability

Hello!

I hope you are doing well!

We are a security research team. Our tool automatically detected a vulnerability in this repository. We want to disclose it responsibly. GitHub has a feature called Private vulnerability reporting, which enables security research to privately disclose a vulnerability. Unfortunately, it is not enabled for this repository.

Can you enable it, so that we can report it?

Thanks in advance!

PS: you can read about how to enable private vulnerability reporting here: https://docs.github.com/en/code-security/security-advisories/repository-security-advisories/configuring-private-vulnerability-reporting-for-a-repository

Is this extra code?

Line 52 in SignatureBuilder.cs in WriteMetadata(Stream stream, ISignatureWriter signatureWriter)
...
var hash = HashAlgorithm.ComputeHash(stream);
...

This line does not appear to have any effect. hash as passed to WriteMetaData, is not used. Removal of this line does not effect the resulting signature file/delta/patch files or the resulting patched files in my testing. This seems to calculate a value (that takes a very long time on large files) and then throw that value away. What is the purpose of this line?

Adler32RollingChecksumV2 seems to give bad results

Description

When using the V2 rolling checksum algorithm, files that are identical or very slightly different result in huge deltas: the whole new file gets added as the delta.

Environment

  • repo freshly cloned from the current master branch (commit d87ee31)
  • VS2022 17.2.6 on Windows 10 x64

I had to make a small code change so the command line app would use the V2 algorithm by default:

diff --git a/source/Octodiff/Core/SupportedAlgorithms.cs b/source/Octodiff/Core/SupportedAlgorithms.cs
index 2cc2aa5..5552f13 100644
--- a/source/Octodiff/Core/SupportedAlgorithms.cs
+++ b/source/Octodiff/Core/SupportedAlgorithms.cs
@@ -52,7 +52,7 @@ namespace Octodiff.Core
 
         public virtual IRollingChecksum Default()
         {
-            return Adler32Rolling();
+            return Adler32Rolling(true);
         }
 
         public virtual IRollingChecksum Create(string algorithm)

Steps to reproduce

  • grab a random binary file; my test was kernel32.dll from windows\system32
  • create 2 copies of it: copy1.dll and copy2.dll
  • modify copy2.dll very slightly; I simply changed the first byte from 'M' to 'A'
  • run octodiff to create the deltas:
    • Octodiff.exe signature kernel32.dll signature.bin
    • Octodiff.exe delta signature.bin copy1.dll delta1.bin
    • Octodiff.exe delta signature.bin copy2.dll delta2.bin
  • observe how the delta files are very "not delta-y"

Other notes

The V1 version of the algorithm does produce expectedly small delta files.

Adler32RollingChecksum modulo operation

Hi,

In Adler32 algorithm https://en.wikipedia.org/wiki/Adler-32 the prime used in modulo operation equals 65521 (largest prime smaller than 65536).

A = 1 + D1 + D2 + ... + Dn (mod 65521)
B = (1 + D1) + (1 + D1 + D2) + ... + (1 + D1 + D2 + ... + Dn) (mod 65521)
   = n×D1 + (n−1)×D2 + (n−2)×D3 + ... + Dn + n (mod 65521)

In the sources Adler32RollingChecksum.cs you cast to ushort which effectively does a mod 65536. Is there any particular reason for that?

The zlib RFC https://tools.ietf.org/html/rfc1950 that also defines Adler32 states that

That 65521 is prime is important to avoid a possible large class of two-byte errors that leave the check unchanged.

Regards,
Grzegorz Blok

Class SignatureWriter is not public

When trying to reference the dll in my code I am unable to access the SignatureWriter class. Can we please get this class set as public in file ISignatureWriter.cs

I appreciate the support.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.