I tried the untagged binary format, and it just works, which is amazing. I have a few

Thoughts on untagged binary format about glaze HOT 4 CLOSED

kalradivyanshu commented on June 4, 2024

Thoughts on untagged binary format

from glaze.

Comments (4)

kalradivyanshu commented on June 4, 2024 1

Thank you so much for such a detailed response. I didn't think about the integer encoding, I will definitely look into LZ4 encoding! Thanks!

from glaze.

stephenberry commented on June 4, 2024

Great thoughts. However, BEVE is highly concerned with performance. If you were to write all integers as compressed integers you would have a 10X or greater performance loss for large arrays (not being able to easily do memcpy). Also, if you simply use a compression algorithm on your BEVE data then you gain most of the compression benefits and it becomes entirely opt in.

BEVE is designed to be easily compressed, a value of 20 in a uint64_t means that you have 7 consecutive zero bytes. If you often have numbers like this then a compression algorithm will easily handle it.

When it comes to headers, they are necessary if the data is to be written to file and loaded by another program without having a schema. I much prefer schema less formats, as they are much easier to debug and allow files to be archived long term without needed to save matching schema documents. BEVE is also designed to convert directly to/from JSON, so that's another requirement for headers.

I do like the idea of a header-less binary format that focuses on minimizing memory. I think it would be a good addition to Glaze. If you wanted to add this raw binary format to Glaze, I would be happy to merge it in. But, you may find that simply using a compression algorithm would solve your issues.

from glaze.

kalradivyanshu commented on June 4, 2024

Couldn't a compromise be if non array compressed integers, if array, sized integers?

I do get the compression argument, i just worry about the performance impact of compressing (my use case is sending a lot of data on udp, so compressing over and over 1.5kb of packets, is not super efficient) i do agree with it can work for storing in files.

Regarding the header-less format, if I were to implement it in glaze, where do I start, can you give me some pointers?

from glaze.

stephenberry commented on June 4, 2024

Couldn't a compromise be if non array compressed integers, if array, sized integers?

It actually isn't a good compression mechanism for integers:

No means of storing the larger values (63 or 64 bit) of uint64_t and int64_t, so these types would require another byte to indicate their type.
With smaller sized arrays compression is more valuable, so using compressed integers for size indicators makes a lot of sense. It is more important to save some bytes on an array of 3 values versus an array of 3,000 values. But, using this form of compression on integers in general means that we would be making our file larger. For example, uint8_t values from 65 - 255 would require an extra byte. So, we don't actually save anything for the majority of uint8_t values. The same is true for the other integer types, that 75% of the time (2 bits quarters our range) we don't get compression savings. The issue is made worse by the fact that we use power of 2 bytes to store integers. So if we were to store 16384 (2^14) in a uint16_t we would have to bump the storage integer to a uint32_t. This is adding 2 bytes to 75% of our uint16_t values. So, you can see that this is generally a bad compression algorithm for integers and really only makes sense for compressing sizes of arrays and objects. A compression algorithm like LZ4 will usually (statistically) be much more efficient than compressing integers in the manner that you and BEVE have implemented.
Using a compression algorithm will also find patterns in your numbers that are next to each other. So a compression algorithm will handle a bunch of zeros much better than using the BEVE size indicator compression.

I do get the compression argument, i just worry about the performance impact of compressing...

I'll note that another argument for compression is that if you have strings and care about size (and network performance), then you probably should be compressing your data. Because compressing strings will significantly save space and therefore transfer time.

High speed compression algorithms will run faster than 500 MB/s, and sending less data over UDP will also improve performance. So, you will likely gain back the compression time by needing to transfer less data. I think LZ4 is probably an excellent choice for your use case.

I would like to add some compression helpers to Glaze, to make it easier to work with BEVE and compression, and at my work I actually have the need for high speed compression as well. So, I'll be working on this in the near future. One thing to note is that if your system would allow two cores for serializing data, then we can actually run the compression algorithm in parallel with the BEVE serialization. This would mean that there would be almost zero overhead to compression, but it would use another thread. I'll write up an issue for this, because it is a feature I would like to have.

Regarding the header-less format, if I were to implement it in glaze, where do I start, can you give me some pointers?

I think the BEVE format works for everything you want, except for headers within structs and tuple-like arrays.

Thanks for getting me to consider this more, because I'm now thinking we don't need to implement a completely new format. Rather, I think we can add BEVE extensions for raw-byte objects and arrays. These wouldn't be schema-less, but would be great for where size is critical. And, I think adding them to a format that is generally schema-less and allows tags is a benefit, because the user can decide how much introspection they want versus message size.

I'll make a performance note as well. That if a C++ struct is_standard_layout (holds trivial types like ints, bool, and floats) then we don't have to iterate over the elements of the struct and can simply memcpy the entire struct. This will provide a significant performance improvement for these kinds of structs and is extra motivation to support this header-less format.

In conclusion, hold off on implementing a header-less format until I've figured out how best to add it to BEVE. In the meantime, I would recommend experimenting with LZ4 and see if it helps you.

from glaze.

Thoughts on untagged binary format about glaze HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent