Coder Social home page Coder Social logo

yannickperrenet / huffman-coding Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 97 KB

๐Ÿƒ The fastest Python implementation of Huffman coding by using a Finite State Machine

License: MIT License

Python 100.00%
finite-state-machine fsm huffman-coding huffman-decoder huffman huffman-compression-algorithm streams

huffman-coding's Introduction

Huffman coding

In summary, its fast, has no dependencies and works with files that don't fit into memory.

Why another implementation you ask?

  • All other Python implementation's of the Huffman coding algorithm are written for educational purposes, whereas this package is actually intended to be used.

  • It can be run as a standalone script because it is a pure Python implementation with no additional dependencies outside the standard library.

  • It works with streams exclusively, in a buffered fashion. Meaning that you can encode and decode files (or streams) that are larger than memory. The total memory consumption peaks at just 70KB.

  • In contrast to other implementations, the encoding step actually encodes the frequency table as well (which the decode step expects to read). This means that the original text does not need to be kept for decoding.

    Let's say you want to share a file over network, then it often improves performance to first compress the file before sending it over the wire. However, you want to be able to decompress/decode the received file without requiring the original (since that would destroy the entire purpose of compressing in the first place). Be sure to check out the client-server example showcasing the above by compressing and sending a file over a socket.

  • The decoding step is significantly faster (4x) by using an FSM so it can operate on an entire byte at once, instead of decoding bit-by-bit.

    This is not a novel idea though, although simply not implemented by other pacakges. Basically, I noticed that the RFC for HPACK: Header Compression for HTTP/2 included a static Huffman code which was being decoded using an FSM in Python's HPACK implementation which essentially "unrolled" the FSM by operating on a byte-level instead of bit-level. I simply had to generate a FSM dynamically based on the Huffman code (see _get_fsm_decoder()) to be able to apply the same principle to non-static Huffman codes.

Installation

git clone https://github.com/yannickperrenet/huffman-coding && cd huffman_coding
pip install .
# Or for poetry users
# poetry install

Usage

Let's use Hamlet as a test text.

# Make sure you are in the root huffman_coding directory.
wget https://gist.githubusercontent.com/provpup/2fc41686eab7400b796b/raw/b575bd01a58494dfddc1d6429ef0167e709abf9b/hamlet.txt -O hamlet.txt

CLI

# Encode a given file and write to an output file.
huffman_coding encode --input="hamlet.txt" --output="hamlet.raw"
# Decode from file and write to STDOUT.
huffman_coding decode --input="hamlet.raw"
# Encode file and decode immediately again.
huffman_coding encode --input="hamlet.txt" | ./huffman_coding.py decode

Python

Files

# NOTE: Make sure to pass `newline=""` to `open()` to prevent newline
# translation since that messes with encoding/decoding. Read more here:
# https://docs.python.org/3/library/io.html#io.TextIOWrapper
# NOTE: Disable buffering because `encode()` and `decode()` already
# work in a buffered fashion.
with (
    open("hamlet.txt", mode="r", newline="") as f_in,
    open("hamlet.raw", mode="wb", buffering=0) as f_out
):
    encode(f_in=f_in, f_out=f_out)

with (
    open("hamlet.raw", mode="rb", buffering=0) as f_in,
    open("hamlet.txt", mode="w", newline="") as f_out
):
    decode(f_in=f_in, f_out=f_out)

I/O streams

import io

with open("hamlet.txt", mode="r", newline="") as f:
    text = f.read()

text_stream = io.StringIO()
text_stream.write(text)
# Seek start of stream, otherwise successive reads won't return
# anything.
text_stream.seek(0)

byte_encoding = io.BytesIO()
encode(f_in=text_stream, f_out=byte_encoding, buffering=buffering)
byte_encoding.seek(0)

decoded_text = io.StringIO()
decode(f_in=byte_encoding, f_out=decoded_text)
decoded_text.seek(0)

Running tests

Simply run python3 tests/test_huffman_coding.py or invoke unittest through its CLI python3 -m unittest -vv (make sure that the Python you are running has the package installed, i.e. pip install . in the root directory of this project).

Resources

huffman-coding's People

Contributors

yannickperrenet avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.