The 16 bits in use to communicate the frame size are not necessary, since the first 8 bits of each frame's header contains the number of channels and from that you can calculate the total frame size, because all arrays are only dependent on the number of channels:
sizeof(frame_header) + num_channels * (sizeof(lms_state) + sizeof(qoa_slice_t) * 256)
By dropping the frame size bits, and by extension its value limit of an unsigned 16-bit int, we could theoretically also drop this channel limit:
#define QOA_MAX_CHANNELS 8
As a suggestion, I would propose to include more metadata to improve seekability through frames:
As it stands, each frame can change the number of channels and/or sample rate which means that you need to read each frame header to be able to seek through an audio stream. Even if the number of channels remains constant, you can only seek to certain sample offsets and cannot seek to certain timestamps or even calculate the timestamp after seeking without decoding all frames in between.
If we use the free 16 bits to encode additional metadata, we can include something like this in the frame header
bits 0123456789abcdef
tvvvvvvvvvvvvvvv
t = 0
you need to decode all frames
t = 1
the value `v` (15 bits) indicates the number of following frames that do not deviate in number of channels or sample rate
For live streaming you would use t=0
but for streams encoded ahead of time, you could set t=1
and set the number of frames for which the encoder is certain that the number of channels or sample rate isn't going to change. Even if the encoder isn't sure ahead of time, if the encoder's output is seekable it could write the correct value after the fact.
This way, you only need to decode the first frame to be able to seek to any timestamp as well.