Coder Social home page Coder Social logo

Comments (17)

sunghee-hwang avatar sunghee-hwang commented on August 11, 2024

Let me check my understanding the meaning of using SampleGroups to avoid excessive numbers of timed metadata units.
Based on the current proposal, the timed metadata units are stored in front of each sample in mdat. But, if we consider using SampleGroups, the meaning is that:
During encapsulation the contents of the timed metadata are divided into SampleGroups, and then those are contained inside moov and/or moof instead of contained in mdat. During parsing the file, the contents inside SampleGroups are merged to form the original timed metadata, and it is placed at the front of each relevant sample to form IA bitstream, and is passed to decoders. Am I correct?

from iamf.

cconcolato avatar cconcolato commented on August 11, 2024

Yes, we could specify that the Sample Group data is reinserted in the elementary stream, for example when exporting back to elementary stream syntax. Regarding the integration between the file parser and the decoder, usually ISOBMFF does not specified how it is done. In one implementation, it could be decided done by going back to elementary stream. In an other implementation, the Sample Group information may be passed as side information.

from iamf.

sunghee-hwang avatar sunghee-hwang commented on August 11, 2024

Then, it seems to me that the purpose of using SampleGroups is to save the file storage. I will prepare a summary for "the overhead of timed metadata" vs "the overhead of using SampleGroups for timed metadata".

from iamf.

cconcolato avatar cconcolato commented on August 11, 2024

It is not only to save storage space. It is also a design question:

  • how often is this data expected to change? if it does not change at almost every frame, why should it be in the elementary stream?
  • If the audio decoder does not consume the metadata but it's a post-processor consuming it, why should it be in the elementary stream?
  • do system tools (packagers, inspectors, demuxers ...) need to access this data (e.g. to generate 'codecs' parameter, or to determine encryption boundaries, ...), having it in the sample payload (i.e. in 'mdat') is not optimal.

from iamf.

cconcolato avatar cconcolato commented on August 11, 2024

It would be good if we could have a clear understanding of what is allowed to change and when. For example, we could fill in the following table:

IAC Feature Change possibly Frame by Frame Change some times (but not frame by frame) Change not foreseen at all in a track Change requires decoder reinitialization Change requires rendering reinitialization
codec          
Sample rate          
Ambisonics use          
Ambisonics Order          
Use of Ambisonics demixing          
Ambisonics demixing matrix          
Ambisonics channel mapping          
Ambisonics coupling          
Use of non-diegetic channels          
Count of non-diegetic channels          
Coupling of non-diegetic channels          
Layout of non-diegetic channels (number of DCG, composition of DCG)          
Count of non-diegetic channels in Base Channel Group          
Coefficients for non-diegetic channels (Matrix Downmix Tree)          

from iamf.

sunghee-hwang avatar sunghee-hwang commented on August 11, 2024

All of features except the last one are Non-time metadata. So, the proposal is based on that changes are not foreseen at all in a track.
But the last one (Coefficient for non-diegetic channels) changes some times (but not frame by frame).
Based on on encoder guideline, it may be changed per every 18 frames (0.36 seconds) in worst case.
Please refer to the paper for details.
(https://www.aes.org/e-lib/browse.cfm?elib=21489)

from iamf.

sunghee-hwang avatar sunghee-hwang commented on August 11, 2024

It is not only to save storage space. It is also a design question:

  • how often is this data expected to change? if it does not change at almost every frame, why should it be in the elementary stream?
  • If the audio decoder does not consume the metadata but it's a post-processor consuming it, why should it be in the elementary stream?
  • do system tools (packagers, inspectors, demuxers ...) need to access this data (e.g. to generate 'codecs' parameter, or to determine encryption boundaries, ...), having it in the sample payload (i.e. in 'mdat') is not optimal.

For the first and second,
Timed metadata for the proposal consists of DemixingInfo() and ChannelGroupSpecificInfo().
DemixingInfo() is for its associated frame(sample) and its information can be changed per every 18 frames in worst case but its size is only 1 byte.
ChannelGroupSpecificInfo() is for each Channel Group of its associated frame(sample). It contains sizes of each substream and the Channel Group and gain values for reconstruction which both are changed frame by frame.

So, DemixingInfo() needs to be checked if it needs SampleGroup scheme.
Based on my calculation, the required size for two boxes (sbgp adn sgpd) is 44 + 9 x "# of entry" in byte units.
(for sbgp, 20+8 x "# of entry" and for sgpd, 24+"Size of DemixingInfo()" x "# of entry")
If we assume that DemixingInfo is changed 3 times per every second for simple calculation (i.e. required 3 entries per every second), then it requires 71 bytes for 1s file, 98 bytes for 2s file, 125 bytes for 3s file, and so on.
As DemixingInfo() in the sample payload requires just 1 byte per every single frame, fragmented files more than 2s duration (100 frames) can save its storage by using SampleGroup scheme. Definitely, the less changing frequency of DemixingInfo() provides the more storage saving.

For the third,
I believe that the timed metadata on the proposal does not have such information except boundaries among substreams which can be changed frame by frame. I think that we need to consider this based on the scope of file parers vs. OBU parsers and also the location of OBU parsers compared to decryption entity.

from iamf.

cconcolato avatar cconcolato commented on August 11, 2024

Thanks.

The fact that ChannelGroupSpecificInfo changes every frame is not sufficient to determine if sample groups can be useful. It also depends on how many configurations of ChannelGroupSpecificInfo you will use. If the sample alternates between 2 configurations, sample groups are really appropriate. But if there is a large number of configurations and no pattern in how they are used, it is not a good candidate.

DemixingInfo seems to have only 8 possible values, so it's definitely a good candidate.

We should consider also if it makes sense to use Sample Groups for one and not for the other. It makes processing a bit more complicated.

Note that sbgp is not always required. In sgpd, you can set default_group_description_index and in that case, you don't even need sbgp. Note also that sbgp can be replaced by csgp to encode sample group patterns more efficiently.

from iamf.

cconcolato avatar cconcolato commented on August 11, 2024

If you provide a textual representation (e.g. XML, JSON) of the Timed_Metadata structure for a real stream, with MP4Box, I can generate a real MP4 file for you to look at.

from iamf.

sunghee-hwang avatar sunghee-hwang commented on August 11, 2024

Thanks.

The fact that ChannelGroupSpecificInfo changes every frame is not sufficient to determine if sample groups can be useful. It also depends on how many configurations of ChannelGroupSpecificInfo you will use. If the sample alternates between 2 configurations, sample groups are really appropriate. But if there is a large number of configurations and no pattern in how they are used, it is not a good candidate.

DemixingInfo seems to have only 8 possible values, so it's definitely a good candidate.

We should consider also if it makes sense to use Sample Groups for one and not for the other. It makes processing a bit more complicated.

Note that sbgp is not always required. In sgpd, you can set default_group_description_index and in that case, you don't even need sbgp. Note also that sbgp can be replaced by csgp to encode sample group patterns more efficiently.

Thanks to point it out.
I will look into csgp to figure out SampleGroup usage correctly.

from iamf.

cconcolato avatar cconcolato commented on August 11, 2024

The ChannelGroupSpecificInfo(ambisonics) and ChannelGroupSpecificInfo(channel_audio) are not clear to me, but generally we should keep in the sample data the required information to parse the sample and feed the decoder(s). Anything else meant for the post-processor (downmixing instruction, gain, ...) could go sample groups.

Some specific questions about timed metadata:

  1. Why do need to repeat the stream count? It is already in the static metadata. Or do you envisage that for some samples, some channel groups will have no data?
  2. Can you explain the various size-related fields? Is this similar to self-delimited in Opus?

from iamf.

sunghee-hwang avatar sunghee-hwang commented on August 11, 2024

The ChannelGroupSpecificInfo(ambisonics) and ChannelGroupSpecificInfo(channel_audio) are not clear to me, but generally we should keep in the sample data the required information to parse the sample and feed the decoder(s). Anything else meant for the post-processor (downmixing instruction, gain, ...) could go sample groups.

Let me explain about ChannelGroupSpecificInfo():
The 1st purpose of this Info() is to let IAC file(or OBU) parser know boundaries among substreams (mono/stereo bitstreams) for each ChannelGroup and ChannelGroup size. This 1st purpose is dependent on codecs. So, a codec may or may not need the boundaries.
For Opus, it does not need boundaries among substreams because each substream structure except the last one of the CG is self-delimiting. If a following CG presents, it requires ChannelGroup size. But we may remove this as well for an optimal design by insisting self-delimiting structure on every substream except the last one of the frame not CG.
For AAC-LC, I think that it does need boundaries. Actually, the frame format for AAC-LC is ADTS (Audio Data Transport Stream. ISO/IEC 13818-7) which has a length field inside its header. But the access unit is not ADTS but the payload of ADTS. So, I believe that we need the boundaries for AAC-LC.
Of course, we don't need the boundaries inside timed metadata for the multiple-tracks (one track per a substream).

The 2nd purpose of this Info() is to let decoders to know gain values applied to channels after demixing related to the ChannelGroup. The gain values are changed frame by frame. And, it is only required for Channel_Audio when audio scalability is applied.(In other words, if channel audio consists only one layer (BCG only case), it does not require the gain values.)

Some specific questions about timed metadata:

  1. Why do need to repeat the stream count? It is already in the static metadata. Or do you envisage that for some samples, some channel groups will have no data?
    The stream count is duplicated. So, we can remove it.
  2. Can you explain the various size-related fields? Is this similar to self-delimited in Opus?
    I think that it is better you to refer this updated version of timed metadata (sorry for confusion):
    image

from iamf.

sunghee-hwang avatar sunghee-hwang commented on August 11, 2024

If you provide a textual representation (e.g. XML, JSON) of the Timed_Metadata structure for a real stream, with MP4Box, I can generate a real MP4 file for you to look at.

The JSON file is uploaded to IAC folder.

from iamf.

cconcolato avatar cconcolato commented on August 11, 2024

What I see in your file:

  • 3000 TimedMetadata objects, each with 1 Demixing_Info and 4 Channel_Group_Specific_Info
  • only 5 unique values for Demixing_Info
  • only 3 unique values for Recon_Gain_Flags
  • 519 unique values for Channel_Group_Size
  • 3619 unique Recon_Gain arrays

I attach an mp4 file. The audio content is garbage, don't try to listen to it, but I faked a admi (Audio demixing sample group) based on the info in your JSON file.

out.mp4

You can view the file in
It will show
image

and you can see that instead of the 3000 * 1 byte, the whole signaling of demixing takes 29 sgpd + 284 sbgp bytes.

I need to think if there are ways to optimize the storage of other data not required to decode the sample (e.g. Gain).

from iamf.

sunghee-hwang avatar sunghee-hwang commented on August 11, 2024

Many thanks!
Thanks to your mp4 file, now I clearly understand the usage of Sample Group.

from iamf.

cconcolato avatar cconcolato commented on August 11, 2024

We agree to create a generic sample group where the payload is exactly the OBU content (with header and length). This is to be used for demixing OBU and possibly others in the future. We still need to discuss if sample groups shall or should or can be used, possibly based on the OBU type.

from iamf.

tdaede avatar tdaede commented on August 11, 2024

I am closing this as sample groups are now used as described in the spec.

from iamf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.