/cc @mourner @springmeyer @artemp
The problem
The geobuf format as it is currently defined can't be written in a stream. Instead, the whole data has to be assembled in memory first and then written out. This directly follows from the way Protobuf encodes its messages. A good description of the problem can be found in the header comments of the protobuf writer of the UPB library. In short the problem is that Protobuf uses nested messages to encode the data and each message has a length
header which is encoded as a Varint. But we can't write out the length (or even know how long the length field is, because a Varint is of variable length) before we have the whole message assembled.
This is, of course, a major problem for a format that is intended for huge files.
A possible solution: Remove outermost message
Because this is an inherent limitation of the Protobuf format we have to look outside the format for a solution. Of course we could throw away the whole Protobuf format, but thats not needed. Whats needed is a wrapper around it so that the Protobuf encoder/decoder only sees part of the data. In the simplest case we encode the data in pieces:
We remove the outermost message Data and then write each of the other data pieces on their own as their own protobuf message. We might want to move the keys
, dimensions
, and precision
into a message Header
or so. The oneof data_type construct doesn't work any more, instead we just have to keep reading messages until EOF. But we don't know which kind of messages will be in there (Feature
, FeatureCollection
, etc.) so we have to add this information to the newly introduced message Header
in some way and then parse accordingly. This should certainly be doable and doesn't require a lot of change. But I think there is...
A better solution: Chunking the data
This looks slightly more complicated in the beginning but has many advantages, so bear with me. Lets encode the data in chunks, each chunk gets a length field and the data:
CHUNK
LENGTH (4 bytes)
DATA (LENGTH bytes)
CHUNK
LENGTH (4 bytes)
DATA (LENGTH bytes)
...
Each DATA block is a complete Protobuf message which can be parsed on its own. This idea is, of course, not new. It is what the OpenStreetMap OSM PBF is doing.
A typical DATA field will contain maybe a few thousand geometries or features. Note that this does not mean that the contents of the different chunks are somehow logically distinct. Logically this is still one data stream. The chunking is purely an encoding issue and files with the same data split up into different sized chunks would still represent the same data.
This format some added advantages:
- the LENGTH field can tell us how much memory to allocate for buffering the DATA part
- reading and writing can be done in parallel, because several threads can work on encoding/decoding different chunks at the same time. In fact thats what Libosmium does when parsing OSM PBF.
- concatenating two files is trivial: deal with the headers, then just copy data chunks
Now it gets a little bit more complicated than that. (This is again based on experience with the OSM PBF format.) The first chunk should probably contain some kind of header. This could include metadata such as the dimensions
setting and the keys
. All other chunks contains the data itself. So chunks (OSM PBF calls them Blobs) need to contain some kind of type identifier to differentiate between a header chunk and a data chunk. OSM PBF does this by adding another level of Protobuf encoding (See BlobHeader
and Blob
messages) which seems like overkill to me. It makes the implementation rather confusing and probably slower. Instead we can just add a type field:
CHUNK
HEADER (fixed size)
TYPE (1 byte, first chunk always =META)
LENGTH (4 bytes)
DATA (LENGTH bytes)
CHUNK
HEADER (fixed size)
TYPE (1 byte, following chunks always =GEOMDATA)
LENGTH (4 bytes)
DATA (LENGTH bytes)
...
Strictly speaking we can live without that TYPE field, because the header always has to be the first chunk and following chunks data, but it seems cleaner and allows us more flexibility if we have this type. And maybe we want to have different types of data? This is something which has to be explored.
OSM PBF adds another useful functionality: Encoding chunks (or Blobs) with zlib or other compression formats. Each chunk can be optionally compressed and the type of compression is noted in the chunk header. This can squeeze out the last bytes from the resulting files, but it is still possible to encode and decode the file in chunks and in parallel. To add this we need, again, some type field. And we should also store the size of the uncompressed data, because it allows us to give the decompressor a buffer with the correct size.
Note that I have used 4 byte length fields in my description. This is probably enough for each chunk, in fact chunks should not get too big, because each one has to fit into memory after all (several is we encode/decode in parallel). OSM PBF has some extra constraints on sizes of different structures which can help with implementations because fixed-sized buffers can be used.
Note also that there is no overall length field for the whole file. Thats important, because it allows use to streaming write the data without knowing beforehand how many features the file will contain or how big it will be. (We might want to end the file with some END chunk that marks the end of file to guard against truncation. Optionally it could contain a checksum. This is something that OSM PBF is missing, but could be useful to detect data corruption.)
And while we are at it, I suggest adding a fixed 4-byte (or so) magic header that is always the same but can be used by tools such as find
to determine the file size easily and a fixed-size version field for future-proofing the format.
This brings us to something like this:
MAGIC (fixed size)
VERSION (=1, fixed size)
CHUNK
HEADER (fixed size)
TYPE (1 byte, first chunk always =META)
COMPRESSION_TYPE (1 byte)
RAW_LENGTH (4 bytes)
ENCODED_LENGTH (4 bytes)
DATA (LENGTH bytes)
CHUNK
HEADER (fixed size)
TYPE (1 byte, first chunk always =GEOMDATA)
COMPRESSION_TYPE (1 byte)
RAW_LENGTH (4 bytes)
ENCODED_LENGTH (4 bytes)
DATA (LENGTH bytes)
...
CHUNK
HEADER (fixed size)
TYPE (1 byte, last chunk always =END)
COMPRESSION_TYPE (1 byte)
RAW_LENGTH (4 bytes)
ENCODED_LENGTH (4 bytes)
DATA (LENGTH bytes)
CHECKSUM
Some padding might be necessary to have length fields on 4-byte boundaries etc. And all length fields should probably be encoded in network byte order. Those details can be worked out.
Inside the DATA we'd still use the Protobuf-encoded data (nearly) as before. No big change there. For some things such as the keys
we have to discuss whether they fit better in the META header or the DATA part. In the META and END blocks we can use Protobuf, too, or any other encoding. Because thats not a lot of data it isn't that important that we pack it so tightly and using a simpler format might allow simpler access to the metadata. On the other hand Protobuf is tried and true and allows for easy extensibility.