WBZ

Check the article here: How to Build a Lossless Data Compression and Data Decompression Pipeline

A parallel implementation of the bzip2 data compressor in python, this data compression pipeline is using algorithms like Burrows–Wheeler transform (BWT) and Move to front (MTF) to improve the Huffman compression. For now, this tool only will be focused on compressing .csv files and other files on tabular format.

Data pipeline compression

How to use the tool

The tool is called WBZ, the first version only will be focused in compressing .csv files and I will be adding more features coming soon, the parameters are described as follow:

python wbz.py -a encode -f 'C:\Users\...\data.csv' -cs 20000 -ch ';'

python wbz.py -a decode -f 'C:\Users\...\data.wbz' -cs 20000 -ch ';'

-a is action , there is two actions: encode and decode
-f is filepath, if your action is encode make sure that the filepath choosed is a .csv file, if your action is decode make sure that you choosed filepath is a file with extension .wbz
-cs is chunk size, the algorithm Burrows–Wheeler transform (BWT) works with chunks sized in bytes, with this parameter you would specify the number of bytes to be processed by each CPU.
-ch is special character, each chunk encoded by the algorithm Burrows–Wheeler transform (BWT) will contain an special character inside it, it will help to identify an index for decodeding purposes, The possible column separator characters in your .csv file could work as a special character, it is recommended to use a separator that is not used by your columns and that does not appear in the content of the columns, this feature will be removed in the next versions of this tool.

The same chunk size and special character used for encode the file must be used for decode the file, The idea of keep them as parameters is to be able to get a better trade-off of the speed in the encoding and decoding process and a better compression rate.

Performance

The tests were done with three .csv files of different sizes and varying the chunk size:

data_1000000: One million records (61mb)
data_500000: Half a million records (31 mb)
data_250000: A quarter of a million records (16 mb)

There is an improvement in the rate compression for larger chunk sizes.

The compression times increase with a logarithmic behavior when the size of the chunk is increased as well.

Regardless of the size of the file, the decompression times have a constant behavior and tend to be reduced when the size of the chunk increases as well.

To-Do List:

Improve the compression times of huffman and BWT encoding times.
Improve the encode of the huffman table
Compression based on columns
Compress and decompress specific columns on the .csv file
Generate compressed chunks automatically for large files
Distributed compression and decompression

Contributing and Feedback

Any ideas or feedback about this repository?. Help me to improve it.

Authors

Created by Ramses Alexander Coraspe Valdez
Created on 2022

License

This project is licensed under the terms of the Apache License.

wittline / wbz Goto Github PK

wbz's Introduction

WBZ

Check the article here: How to Build a Lossless Data Compression and Data Decompression Pipeline

Data pipeline compression

How to use the tool

Performance

To-Do List:

Contributing and Feedback

Authors

License

wbz's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent