byrantwithyou / ezdedup Goto Github PK
View Code? Open in Web Editor NEWThis project forked from davidlohr/ezdedup
dedup benchmark, without the hassle
This project forked from davidlohr/ezdedup
dedup benchmark, without the hassle
ezdedup: deduplication workload, made easy. Overview ======== The ezdedup workload is taken from the original Princeton Application Repository for Shared-Memory Computers (PARSEC 3.0) suite: http://parsec.cs.princeton.edu It is simplified for kernel development/performance purposes, and as such its usage does not rely on anything from PARSEC. This makes the program usage significantly more straightforward. As described by the original 2011 "Benchmarking Modern Multiprocessors" Ph.D thesis by Christian Bienia, deduplication is a form of compression stream with a combination of global compression and local compression in order to achieve high compression ratios. The dedup workload uses a pipeline model for function level parallelism, with five stages: (1) Read the input file from disk and determines the locations where the data is to be split up by jumping a fixed length in the buffer for each chunk. The resulting data blocks are enqueued for the next stages. These are coarse grained chunks. (2) Identifies brief sequences in the data stream that are identical with sufficiently high probability (anchoring) by using a rolling hash to segment data based on its contents. The data is then broken up into two separate blocks at the determined location. (3) Computes a SHA1 checksum for each chunk and checks for duplicate blocks with the use of a global database (4) Compresses each data segment with the Ziv-Lempel algorithm and builds a global hash table that maps hash values to data. Every data block is compressed only once because the previous stage does not send duplicates to the compression stage. (5) Assembles the deduplicated output stream consisting of hash values and compressed data segments. Note that stages (i) and (v) are serial. Please refer to the document described above for complete details. Usage ===== dedup [-cusfvh] [-w gzip/bzip2/none] [-i file] [-o file] [-t number_of_threads] -c compress -u uncompress -p preloading (for benchmarking purposes) -w compression type: gzip/bzip2/none -i file the input file -o file the output file -t number of threads per stage -v verbose output -h help Examples: o Compress a qemu image with qcow2 compression, each parallel stage will use two threads: $> dedup -c -v -p -t 2 -i linux.qcow2 -o outfile PARSEC Benchmark Suite Total input size: 2433.81 MB Total output size: 1108.74 MB Effective compression factor: 2.20x Mean data chunk size: 0.22 KB (stddev: 4022.08 KB) Amount of duplicate chunks: 95.99% Data size after deduplication: 2028.62 MB (compression factor: 1.20x) Data size after compression: 762.95 MB (compression factor: 2.66x) Output overhead: 31.19% o Uncompress the output file and restore its original size: $> dedup -u -v -p -t 2 -i outfile -o originalfile
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.