saxbophone / basest-python Goto Github PK

View Code? Open in Web Editor NEW

6.0 2.0 0.0 194 KB

Arbitrary base binary-to-text encoder (any base to any base), in Python.

Home Page: https://pypi.org/project/basest/

License: Mozilla Public License 2.0

Python 98.80% Makefile 1.20%

number-base-converter encoder decoder conversion encoding decoding encodings base64 base58 base85

basest-python's Introduction

I've been programming for a little over a decade now, to a professional level for several years.

My professional experience is mostly in web dev, but I also have some Windows desktop application experience.

Projects I am particularly proud of:


github.com/saxbophone/arby	arby is a C++ library implementing arbitrary-precision arithmetic, both at runtime and compile-time! It exposes convenient-to-use class types encapsulating the arithmetic, with operator overloading and standard stream support.
github.com/saxbophone/hexago	hexago is a cross-platform screensaver written in C++. It draws pretty shrinking hexagons! It integrates with the screensaver frameworks of both macOS and Windows (with some Objective-C++ glue code for the former!)
Cross-platform C++20 project template	This is a Github project template intended for cross-platform C++20 dev. It includes an extensive CMake project config with lots of warning options enabled, and Github Actions CI config for unit testing on Linux, macOS and Windows. I use it for all my stuff and other people seem to find it useful too.
github.com/saxbophone/libsxbp	sxbp and its implementation library, libsxbp are a pair of C projects exploring unconventional barcodes and procedural image generation. They implement a novel barcode of my own design, where binary bits are encoded by guiding the line of a right-angled spiral left or right as prescribed by the input data. Unfortunately, producing a compact-enough spiral that does not waste lots of empty space in the image it produces is a very computationally expensive process for barcodes longer than about 20 bits, but it was a fun an interesting experiment and a good practice at writing a well-documented C API with callbacks and error-handling.
github.com/saxbophone/unmoving	unmoving is a C++20 baremetal library providing more convenient support for fixed-point arithmetic as used on the PlayStation. Getting a cutting-edge version of G++ to cross-compile for the PlayStation and programming within the constraints of bare metal was a fun challenge!

Other fun stuff


github.com/saxbophone/wondercard	Emulating the communications protocol used for PS1 memory cards in software
github.com/saxbophone/tr-sort	Experimental sorting algorithm which attempts to calculate the rough position each element should be
github.com/saxbophone/colour-distance	Web app for finding colours that are "n distance away from" a given colour, intended for interior design
github.com/saxbophone/triangberg	Just for fun, animated geometrically-constructed fractal-like arrangements of triangles
github.com/saxbophone/zench	C++ Z-machine interpreter, work in progress
github.com/saxbophone/galley	Galois Field arithmetic using compile-time-generated lookup tables
github.com/saxbophone/dengr	Partial reverse-engineering of the low-level data encoding of Compact Discs
github.com/saxbophone/lzw-bit	Bit-by-bit LZW compression with redundant-code-elimination

basest-python's People

Contributors

Stargazers

Watchers

basest-python's Issues

Encoder/Decoder corruption for some larger output bases

Encountered an issue decoding symbols that were encoded from base 128 to base 255.
I have a hunch that this is because the ratios are not exact and the output base is larger than the input base.

Currently, for all cases when decoding, empty padding symbols are converted to MAX just before decoding, like in base-85. I think this might only work when the input base is larger than the output base, so a different approach may be needed for when the output base is larger.

Code for Encoder class:

from basest.encoders import Encoder


class StrictAsciiSquashEncoder(Encoder):
    input_base = 128
    output_base = 255
    input_ratio = 9
    output_ratio = 8
    # The Strict ASCII Set
    input_symbol_table = [
        s for s in
        '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f'
        '\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f'
        ' !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_'
        '`abcdefghijklmnopqrstuvwxyz{|}~\x7f'
    ]
    # Bytes 0 to 254
    output_symbol_table = [
        s for s in
        '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f'
        '\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f'
        ' !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_'
        '`abcdefghijklmnopqrstuvwxyz{|}~\x7f'
        '\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f'
        '\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f'
        '\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf'
        '\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf'
        '\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf'
        '\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf'
        '\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef'
        '\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe'
    ]
    padding_symbol = '\xff'

Sample decoding errors:

>>> sa = StrictAsciiSquashEncoder()
>>> 
>>> ''.join(sa.encode('slartybartfast'))
'z_\x92d$\xceW\xce\xf6\x11t\x0b\xff\xff\xff'
>>> sa.decode(''.join(sa.encode([s for s in 'slartybartfast'])))
['s', 'l', 'a', 'r', 't', 'y', 'b', 'a', 'r', 't', 'f', 'a', 's', 'v']

Create custom error classes

Custom error classes should be created for the library, which inherit from the Python error classes.

This will make testing easier and prove that exceptions are being raised by code of mine that specifically checks for error conditions and not because of default Python error-handling.

Class-based Encoder system

Create a system where an Encoder base class can be inherited from and some attributes set which describe a custom encoding algorithm.

This could then be used to create example encoders/decoders for existing binary-to-text encoding systems such as base64, ascii85 etc...

Add error-handling for non-unique encoding alphabets

When using encode() and decode(), ValueError should be raised if either the input or output alphabets are not unique (or if the output alphabet contains the output padding character).

Split up unit test files

Some of these have become very large, some contain more than one class.

These should be split out into separate files and where common parts are needed, these common parts should be refactored out to separate files for inclusion where needed.

Revise the class-based interface

This needs tidying up, there's two or three routes I can go down:

Remove the need to instantiate an encoder class by making the methods @classmethod
~~Change the paradigm to having the encoding settings set at object construction~~ decided against
Use class inheritance far more to allow customisation at the inheritance level, using mixin classes e.g. there might be a StreamEncoder base class and a MappingEncoder mixin class. This would probably require deprecating the functional interface entirely.

Generator-based Streaming Encoder and Decoder interfaces

These would most likely replace the core encoder and decoder functions and would function in some way that would allow partial output of an encoded or decoded stream once it has received enough input.

E.g. say an encoding ratio of 4 to 5 symbols is being used, then the encoding generator would output two symbols after receiving two input symbols, as two symbols' input from a ratio of 4 is enough to calculate the values of two symbols' output for a ratio of 5.

This would have a potential speed efficiency improvement for encoding and decoding streams of data, and the other more traditional encode() and decode() functions could piggy-back on its functionality and just return the result as a list when done.

Add validation of encoding/decoding options

Validation should be added to the functions that take options defining the parameters for a given custom encoding system, to ensure that they are given a sane configuration. This matters because:

It's not possible to encode from a smaller base to a larger one with padding (i.e. if given input data that is not the same length as the input window). This will corrupt the data and prevent the same from being retrieved verbatim.
I should check that arguments are evaluated for typing.

Decoder function

Create a decoder function, behaving directly opposite to basest.encode

Setup Travis CI Builds

Use the Makefile for build commands.
Builds to be tested on the following CPython versions:

Add version of best_ratio() that searches based on range of output chunk sizes

Currently, basest.core.best_ratio() only allows a range of input chunk sizes to be given. This is inconvenient if the constraints are on the actual size of output chunk (say we want to know how many base-N symbols we can fit in 1KiB of space for instance).

Thus, an optional feature should be added allowing the output chunk size range to be specified instead.

It might also be possible to supply constraints for both input and output chunk sizes, but not sure how feasible this is.

Document Encoder class

Added in #5 - needs documenting

Fix setup.py so the package can be published to PyPi

It turns out that PyPi has changed the package upload process by quite a lot since I last started writing this project. My setup.py script now doesn't work at all!

There is a guide available on the PyPi project for how to migrate older projects, so I should read this and apply which parts of that are relevant to my own package-publishing process.

Note: should use test PyPi to check this is working properly.

Better validation of parameters in constructors and some functions

It's currently possible to construct an Encoder instance without any arguments, but this is inherently unusable as member variables then get set to None.

Change best_ratio() to accept multiple input bases

This is a bit of a niche need, but not inconceivable (this would allow someone to find which is the most efficient encoding combination of several different bases).

Add error-handling for invalid input sequences

When trying to encode or decode an input sequence which has unexpected symbols (or it's the wrong length when decoding), then ValueError should be raised.

Add more stress-tests

test partial input window with larger input base
test complete input window with smaller input base

Stress-tests

Write some more comprehensive stress-tests which check that the encoder and decoder functions successfully handle many different unusual output bases, including with partial input to check that padding works successfully across any output base.

Publish Package to PyPi

Prerequisites:

Choose a software license - Chosen - Mozilla Public License v2.0
Check the package works on PyPi - Use test PyPi
Build passing on all target platforms Platforms to test: 2.7, 3.3, 3.4, 3.5

Make padding symbol optional when not required

Padding symbol is not required when:

The input ratio is 1
Input is always a multiple of the input ratio (e.g. for small base to big base encoders).

Add sample encoders

Perhaps these could be included in an examples module.

Well-known encoders I'd like to produce examples for:

Base64
Base64, URL-safe variant
Ascii85
Base85 (revised version of Ascii85 conforming to RFC 1924)
Z85 (ZeroMQ version of Ascii85)
Base91
Base32
Base16 (maybe too easy?)
Base36
Base58

With the exception of the base-85 schemes (which perform some additional kind of run-length encoding on certain output patterns, all of these should be rather trivial to implement as subclasses, and might serve as more helpful documentation and proof-of-concept to potential users.

Regarding the output transformation of the Ascii85 variants, it might be worth holding off until issue #27 is done (the pre-processing and post-processing ideas would be very helpful for such a scheme as Ascii85).

Raw encode and decode functions

These will accept and output numbers only, rather than symbols. The ordinary encode() and decode() functions would then change to be just wrappers around these, and convert to and from the different symbol sets.

Cleanup files after packaging fixes

MANIFEST.in can I think be removed from source control, as it appears to be auto-generated.
Change the project description to that currently used as the Github repo description: Arbitrary base binary-to-text (or anything to anything) encoder.