Coder Social home page Coder Social logo

ezdedup's Introduction

ezdedup: deduplication workload, made easy.

Overview
========

The ezdedup workload is taken from the original Princeton Application Repository
for Shared-Memory Computers (PARSEC 3.0) suite: http://parsec.cs.princeton.edu
It is simplified for kernel development/performance purposes, and as such its
usage does not rely on anything from PARSEC. This makes the program usage
significantly more straightforward.

As described by the original 2011 "Benchmarking Modern Multiprocessors" Ph.D
thesis by Christian Bienia, deduplication is a form of compression stream with a
combination of global compression and local compression in order to achieve high
compression ratios. The dedup workload uses a pipeline model for function level
parallelism, with five stages:

   (1) Read the input file from disk and determines the locations where the
       data is to be split up by jumping a fixed length in the buffer for each
       chunk. The resulting data blocks are enqueued for the next stages. These
       are coarse grained chunks.

   (2) Identifies brief sequences in the data stream that are identical with
       sufficiently high probability (anchoring) by using a rolling hash to
       segment data based on its contents. The data is then broken up into two
       separate blocks at the determined location.

   (3) Computes a SHA1 checksum for each chunk and checks for duplicate blocks
       with the use of a global database

   (4) Compresses each data segment with the Ziv-Lempel algorithm and builds a
       global hash table that maps hash values to data. Every data block is
       compressed only once because the previous stage does not send duplicates
       to the compression stage.

   (5) Assembles the deduplicated output stream consisting of hash values and
       compressed data segments.

Note that stages (i) and (v) are serial. Please refer to the document described
above for complete details.

Usage
=====

dedup [-cusfvh] [-w gzip/bzip2/none] [-i file] [-o file] [-t number_of_threads]
-c                      compress
-u                      uncompress
-p                      preloading (for benchmarking purposes)
-w                      compression type: gzip/bzip2/none
-i file                 the input file
-o file                 the output file
-t                      number of threads per stage
-v                      verbose output
-h                      help

Examples:

o Compress a qemu image with qcow2 compression, each parallel
  stage will use two threads:

$> dedup -c -v -p -t 2 -i linux.qcow2 -o outfile
PARSEC Benchmark Suite
Total input size:                     2433.81 MB
Total output size:                    1108.74 MB
Effective compression factor:            2.20x

Mean data chunk size:                    0.22 KB (stddev: 4022.08 KB)
Amount of duplicate chunks:             95.99%
Data size after deduplication:        2028.62 MB (compression factor: 1.20x)
Data size after compression:           762.95 MB (compression factor: 2.66x)
Output overhead:                        31.19%

o Uncompress the output file and restore its original size:

$> dedup -u -v -p -t 2 -i outfile -o originalfile

ezdedup's People

Contributors

byrantwithyou avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.