Comments (4)
Thank you so much for such a detailed response. I didn't think about the integer encoding, I will definitely look into LZ4 encoding! Thanks!
from glaze.
Great thoughts. However, BEVE is highly concerned with performance. If you were to write all integers as compressed integers you would have a 10X or greater performance loss for large arrays (not being able to easily do memcpy). Also, if you simply use a compression algorithm on your BEVE data then you gain most of the compression benefits and it becomes entirely opt in.
BEVE is designed to be easily compressed, a value of 20 in a uint64_t means that you have 7 consecutive zero bytes. If you often have numbers like this then a compression algorithm will easily handle it.
When it comes to headers, they are necessary if the data is to be written to file and loaded by another program without having a schema. I much prefer schema less formats, as they are much easier to debug and allow files to be archived long term without needed to save matching schema documents. BEVE is also designed to convert directly to/from JSON, so that's another requirement for headers.
I do like the idea of a header-less binary format that focuses on minimizing memory. I think it would be a good addition to Glaze. If you wanted to add this raw binary format to Glaze, I would be happy to merge it in. But, you may find that simply using a compression algorithm would solve your issues.
from glaze.
Couldn't a compromise be if non array compressed integers, if array, sized integers?
I do get the compression argument, i just worry about the performance impact of compressing (my use case is sending a lot of data on udp, so compressing over and over 1.5kb of packets, is not super efficient) i do agree with it can work for storing in files.
Regarding the header-less format, if I were to implement it in glaze, where do I start, can you give me some pointers?
from glaze.
Couldn't a compromise be if non array compressed integers, if array, sized integers?
It actually isn't a good compression mechanism for integers:
- No means of storing the larger values (63 or 64 bit) of
uint64_t
andint64_t
, so these types would require another byte to indicate their type. - With smaller sized arrays compression is more valuable, so using compressed integers for size indicators makes a lot of sense. It is more important to save some bytes on an array of 3 values versus an array of 3,000 values. But, using this form of compression on integers in general means that we would be making our file larger. For example,
uint8_t
values from 65 - 255 would require an extra byte. So, we don't actually save anything for the majority ofuint8_t
values. The same is true for the other integer types, that 75% of the time (2 bits quarters our range) we don't get compression savings. The issue is made worse by the fact that we use power of 2 bytes to store integers. So if we were to store16384
(2^14) in auint16_t
we would have to bump the storage integer to auint32_t
. This is adding 2 bytes to 75% of ouruint16_t
values. So, you can see that this is generally a bad compression algorithm for integers and really only makes sense for compressing sizes of arrays and objects. A compression algorithm like LZ4 will usually (statistically) be much more efficient than compressing integers in the manner that you and BEVE have implemented. - Using a compression algorithm will also find patterns in your numbers that are next to each other. So a compression algorithm will handle a bunch of zeros much better than using the BEVE size indicator compression.
I do get the compression argument, i just worry about the performance impact of compressing...
I'll note that another argument for compression is that if you have strings and care about size (and network performance), then you probably should be compressing your data. Because compressing strings will significantly save space and therefore transfer time.
High speed compression algorithms will run faster than 500 MB/s, and sending less data over UDP will also improve performance. So, you will likely gain back the compression time by needing to transfer less data. I think LZ4 is probably an excellent choice for your use case.
I would like to add some compression helpers to Glaze, to make it easier to work with BEVE and compression, and at my work I actually have the need for high speed compression as well. So, I'll be working on this in the near future. One thing to note is that if your system would allow two cores for serializing data, then we can actually run the compression algorithm in parallel with the BEVE serialization. This would mean that there would be almost zero overhead to compression, but it would use another thread. I'll write up an issue for this, because it is a feature I would like to have.
Regarding the header-less format, if I were to implement it in glaze, where do I start, can you give me some pointers?
I think the BEVE format works for everything you want, except for headers within structs and tuple-like arrays.
Thanks for getting me to consider this more, because I'm now thinking we don't need to implement a completely new format. Rather, I think we can add BEVE extensions for raw-byte objects and arrays. These wouldn't be schema-less, but would be great for where size is critical. And, I think adding them to a format that is generally schema-less and allows tags is a benefit, because the user can decide how much introspection they want versus message size.
I'll make a performance note as well. That if a C++ struct is_standard_layout (holds trivial types like ints, bool, and floats) then we don't have to iterate over the elements of the struct and can simply memcpy the entire struct. This will provide a significant performance improvement for these kinds of structs and is extra motivation to support this header-less format.
In conclusion, hold off on implementing a header-less format until I've figured out how best to add it to BEVE. In the meantime, I would recommend experimenting with LZ4 and see if it helps you.
from glaze.
Related Issues (20)
- Function signatures inconsistency between write_file_json and read_file_json HOT 2
- Stack overflow when returning intermediate object to serialize in_addr HOT 6
- Bug: Malformed JSON string produced HOT 10
- Build and test for 32-bit in Actions
- std::pair arrays roundtrip
- `float` member issue with `clang++-15` and `g++-12` HOT 3
- `json_test.cpp(7840): warning C4267: '=': conversion from 'size_t' to 'uint16_t', possible loss of data` HOT 1
- glz::reader/glz::writer for incremental reading/writing HOT 5
- Partial read for BEVE
- glz::raw without quotes question HOT 1
- Binary serialization of hidden members HOT 2
- error: constructor priorities are not supported 3316 | const char* argv[]) HOT 3
- gcc and msvc compilation error with explicit constructors HOT 19
- Warnings & Errors in various configurations HOT 16
- rapidjson ĺŻšćŻ HOT 15
- How do I read an array of json objects in glaze? HOT 15
- field-based parse bifurcation HOT 1
- Add the library to nuget HOT 2
- Option to skip reading in null values
- calculate serialize size before serialization and serialize to pre-allocated memory HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
đ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. đđđ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google â¤ď¸ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from glaze.