Coder Social home page Coder Social logo

wittline / wbz Goto Github PK

View Code? Open in Web Editor NEW
13.0 2.0 3.0 592 KB

A parallel implementation of the bzip2 data compressor in python, this data compression pipeline is using algorithms like Burrows–Wheeler transform (BWT) and Move to front (MTF) to improve the Huffman compression. For now, this tool only will be focused on compressing .csv files, and other files on tabular format.

Home Page: https://wittline.github.io/wbz/

License: Apache License 2.0

Python 99.61% HTML 0.39%
data-compression huffman-coding huffman-compression-algorithm big-data bigdata data-engineering burrows-wheeler-transform move-to-front csv-files csv

wbz's Introduction

WBZ

A parallel implementation of the bzip2 data compressor in python, this data compression pipeline is using algorithms like Burrows–Wheeler transform (BWT) and Move to front (MTF) to improve the Huffman compression. For now, this tool only will be focused on compressing .csv files and other files on tabular format.

Data pipeline compression

alt text

How to use the tool

The tool is called WBZ, the first version only will be focused in compressing .csv files and I will be adding more features coming soon, the parameters are described as follow:

python wbz.py -a encode -f 'C:\Users\...\data.csv' -cs 20000 -ch ';'

python wbz.py -a decode -f 'C:\Users\...\data.wbz' -cs 20000 -ch ';'
  • -a is action , there is two actions: encode and decode
  • -f is filepath, if your action is encode make sure that the filepath choosed is a .csv file, if your action is decode make sure that you choosed filepath is a file with extension .wbz
  • -cs is chunk size, the algorithm Burrows–Wheeler transform (BWT) works with chunks sized in bytes, with this parameter you would specify the number of bytes to be processed by each CPU.
  • -ch is special character, each chunk encoded by the algorithm Burrows–Wheeler transform (BWT) will contain an special character inside it, it will help to identify an index for decodeding purposes, The possible column separator characters in your .csv file could work as a special character, it is recommended to use a separator that is not used by your columns and that does not appear in the content of the columns, this feature will be removed in the next versions of this tool.

The same chunk size and special character used for encode the file must be used for decode the file, The idea of keep them as parameters is to be able to get a better trade-off of the speed in the encoding and decoding process and a better compression rate.

Performance

The tests were done with three .csv files of different sizes and varying the chunk size:

  • data_1000000: One million records (61mb)
  • data_500000: Half a million records (31 mb)
  • data_250000: A quarter of a million records (16 mb)

There is an improvement in the rate compression for larger chunk sizes.

Imagen1

The compression times increase with a logarithmic behavior when the size of the chunk is increased as well. Imagen2

Regardless of the size of the file, the decompression times have a constant behavior and tend to be reduced when the size of the chunk increases as well.

Imagen3

To-Do List:

  • Improve the compression times of huffman and BWT encoding times.
  • Improve the encode of the huffman table
  • Compression based on columns
  • Compress and decompress specific columns on the .csv file
  • Generate compressed chunks automatically for large files
  • Distributed compression and decompression

Contributing and Feedback

Any ideas or feedback about this repository?. Help me to improve it.

Authors

License

This project is licensed under the terms of the Apache License.

wbz's People

Contributors

wittline avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.