aomediacodec / iamf Goto Github PK

Immersive Audio Model and Formats

Home Page: https://aomediacodec.github.io/iamf/

Bikeshed 12.83% HTML 87.17%

iamf's Introduction

iamf

Official specification of the AOM group for the Immersive Audio Model and Formats

The specification is written using a special syntax (mixing markup and markdown) to enable generation of cross-references, syntax highlighting, etc. The file using this syntax is index.bs.

index.bs is processed to produce an HTML version (index.html) by a tool called Bikeshed (https://github.com/tabatkins/bikeshed), which is run when content is pushed onto the main branch or when Pull Requests are made.

iamf's People

Stargazers

Watchers

Forkers

cconcolato surfndez jzern jwcullen cyberflamego ledoge phill-williams yilun-zhangs alexismt73 felicialim frank-ba ziasam sunghee-hwang tdaede kr728-kim

iamf's Issues

Consider restructuring of section 3

The goal of section 3 is to define the bitstream (or sequence as suggested #96). The current section has lots of introductory text (up to section 3.5 (included)) or text that is hard to understand without having seen the definitions and semantics. It also makes it error-prone as lots of text is repeated with the semantics.
I suggest that the section should simply consists of the semantics of the different elements forming a bitstream followed by their syntax. For example:
3.1 Sequence Definition
3.1.1. Syntax
3.1.2 Semantics
3.2 Audio OBU
3.2.1 Syntax
3.2.2 Semantics
...

Considerations about random access, synchronization, etc. should follow in a section 4.

Consider supporting multiple "concatenation rules" for Sync OBU

Please refer to AOM document that proposed newer definition of Sync OBU. In that document, a "concatenation rule" was proposed, which described how a parser would know how to position the audio frames and parameters that happened after a Sync OBU with respect to the timeline before the Sync OBU.

The rule proposed in the document works for "continuing" the same audio content across a Sync OBU.

Another rule that would be useful is to concatenate two separate pieces of audio content, where we would want to position new audio so that none of it overlaps the end of the previous audio. I believe the arithmetic for this rule would be just as simple as the other rule.

If we want to do this, we may want to add a few bits to the Sync OBU that can control which kind of concatenation rule should be used.

Determine if multiple tracks are needed to meet requirements

With a single track, sending a subset of the audio to a client requires the audio track to be disassembled and reassembled. With multiple tracks, the client can select the tracks it needs. Determine if the single track implementation is enough to meet the requirements.

Location and duplication of inner codec AudioSampleEntries

Each internal decoder includes its own original AudioSampleEntry. This is potentially large, and is also not currently backwards compatible.

Consider to align endianness

In Ogg Opus (RFC 7845), some of ID header information are defined in little endian while the information in dOps box are defined in big endian.

Add IA Decoding Guileline

Keep byte alignement within metadata OBUs

Better to keep byte alignment among informations within metadata OBUs for convienient IO write/read of players.

Remove dependency on AV1 OBU specification

The description of audio OBUs should be self contained within the IAC spec.

Define Random Access point

We should define at what ISOBMFF samples the stream can be seeked to (if not all of them).

In a normal audio stream, all ISOBMFF samples are random access points, though there is preroll in order to prime the decoder. We should determine if we can use the same definition, or if we need to restrict it.

OBU header optimizations

Not a priority yet, but something we probably want to address.

Current OBU header has so many bit flags it unavoidably adds a byte to the header. This could become an unnecessary bitrate overhead, adding at least 0.5 kbps or more.

With the new proposal that has been agreed on, there may be more opportunities for reducing header. Here are a few strategies and examples of what we could do:

I believe we've already agreed we won't need the sync offset bit.
fold the extension bit into the obu_type enum. There are currently 7 obu types. the upper bit of obu_type could be used to indicate both "additional types" and "use extension header size".
duration and redundancy bits are only needed for parameter blocks, while trimming bits are only needed for audio blocks. Is it possible we could have flags that change depending on obu type? in this case we could save 2 bits.
If needed later, a 3-bit "code" could potentially be a way to represent values of 4 flags if we know there would be less than 8 practical combinations of those flags.

at least with the current status of spec, this would bring us back to 7 or 8 bits for all flags rather than 16.

Combine section 4 and 5 into section 3

Section 3 is dedicated to the encapsulation of ISOBMFF.

Add IA Encoding Guideline

Define how audio elements are "matched up" when multiple SampleGroups are used

If some audio channels are not present in certain SampleGroups, etc, we still need a way to match up streams that should go to the same decoder. There doesn't seem to be enough information on how to do that in the existing specification.

Determine whether codecs parameter should be fixed length instead of variable

Determine the degree of flexibility and list of channel layouts that should be supported

Also compare to existing channel layout description formats.

What should be the `codecs` parameter of the single track approach?

Current audio codecs have a MIME codecs parameter of the following forms:

Opus
mp4a.40.2

We should define what the MIME codecs parameter should be for the single track case. The purpose of the codecs parameter is for a receiver to determine if it can process the track without having to download anything (no header, no initialization segment, ...).

Do we expect that all IAC decoders will be able to process every file or do we expect that some files won't be processable? In the latter case, the codecs should contain something. Also, we should recursively convey the underlying codecs parameter. For example something like: aiac.<IAC-specific-needs>.mp4a.40.2 or aiac.<IAC-specific-needs>.Opus.

Consider a new title for the specification

The specification defines multiple things:

an immersive audio architecture and model
a standalone bitstream format
an ISOBMFF-based format

The current title does not reflect that. We should consider a new title, such as Immersive Audio Model and Formats (IAMF).

Rendering syntax needs to be filled in

It's on our radar, and with any luck we may have some concrete PRs coming soon.

Update section title to clarify the conformance points

According to meeting note for July 11 (https://docs.google.com/document/d/1f68rle2VcwObrufwcwnYuPhRyLlNunZZHdblRzMsZ4A/edit#), need to update introduction section and Informative annex to meet the meeting note.
The purpose of this issue is just to update introduction section and the titles for annexes.

NOTE: After updating introduction section and titles of annex sections as resolving this issue, following issues need to be come up to update actual spec text to specify conformance points for parsing, decoding, rendering etc..

Remove redefinition of dOps and use normative reference for Opus specific box

Rules for the use of a new sample description entry

A common practice in ISOBMFF is to splice files, like 2 programs or one program and an ad. Splicing in ISOBMFF is not as simple as with MPEG-2 TS, unfortunately. The splicer has to merge the moov boxes from both files into 1 moov. In the case of a single-track, this means merging the trak boxes and in particular they stsd boxes. Each codec needs to specify the rules when sample description entries can be merged. This should be done in this case too.

Fix normative reference for opus-in-isobmff

Use SampleGroups to collect timed metadata

Unlike video, audio does not have keyframes (alternately, every sample is a keyframe). Consider using SampleGroups to avoid excessive numbers of timed metadata units.

Add IA Decoding Guideline

Add IAC Decapsulation Guideline

Consider using another term than IAB

Immersive Audio Bitstream (IAB) is already defined in SMPTE ST 2098-2. Consider defining a new name. I suggest using "Immersive Audio Sequence".

DRC behavior needs to be specified

I think this is covered by #105 but it was explicitly mentioned so I want to give it its own issue.

Improve the convention section

Section 1.1 says:

All of obu syntax is described in class which is a structure of C++ program language.

It is not correct. Lots of OBU syntaxes use functions. Consider explaining a bit more how functions are used, e.g. clearly indicating that the conventions are the ones used in the AV1 video specification, the entire section 4. Alternatively consider using MPEG-4 Syntax Description Language.

Section 1.2 defines leb128 as a function in a section called "Type". Use an explicit reference to the AV1 video specification for this function.

Section 1.3 defines the clip3 function in an unclear way. Consider using the exact definition of AV1.

Review description in second half of 3.1 to describe end result rather than process

Split each OBU syntax/semantics into a subsection

Start and end trimming specification

The Opus in ISOBMFF specification makes an edit list mandatory. We should do something similar, either in the bitstream, or as an edit list.

Define Basic Encapsulation Scheme for multi-track approach

Syntax incomplete on parameter blocks

We have some concrete ideas brewing to fill in this information, hopefully to be proposed within a few days.

One quirk worth mentioning sooner than later, it may require creating new syntax rules (new data types that are used to declare both a self-documenting intuitive field name (e.g. master_mix_gain) and a function/class that has a bit more syntax (e.g., parse the parameter ID and a default gain value).

Determine whether we take "OBU_Substream" or "OBU_IA_Coded_Data"

Let's assume that there is N Substreams for each IA frame, then
.Option1 - OBU_IA_Coded_Data requires 3+ 2N bytes (3 bytes for OBU syntax and 2*N bytes for Substream_Boundar_Info)
.Option2 - OBU_Substream requires 3 x N bytes for OBU syntax.

In overhead point of view, Option 2 needs additional N-3 bytes compared to Option 1.
. Option1 is better if N > 3
. Even if N = 3
. Option2 is better if N < 3

In another point of view, Option 2 is well aligned with the philosophy that IA bitstream has been designed to reuse conventional Substream-based encoding and decoding scheme. Option2 does not require Substream_Boundary_Info_OBU while Option1 does it.

Improve Abstract

As for the title in #92, the abstract could use a bit more explanation.

Specify the behavior when an entire audio frame is trimmed

Specification contains git merge errors

The specification contains merge errors like

<<<<<<< HEAD

=======

Place syntax and semantics sequentially for each OBU

Proposal: parameters should not be allowed to automate anything about decoding substreams

In the current version of spec, there is mention that parameters may be used to provide time-varying changes to decode, reconstruction, rendering, and mixing.

As we've been tinkering superficially with implementation, it becomes clear that parameters should not affect decoding of substreams. It should only be applicable to audio element OBUs and mix presentation OBUs (i.e., reconstruction, rendering, and mixing). I do feel it does not significantly limit the expressive power of the concepts we have.

Determine Codec Specific Info for AAC-LC

Improve the introduction

The introduction could be improved:

Some context is missing before defining what is "IA bitstream".
It overlaps with section 3 on the Overview

I suggest restructuring as follows:

start by defining what "immersive audio" means, for example, "the combination of 3D audio signals recreating a sound experience close to that of a natural environment"
then indicates that this specification defines a model for representing immersive audio content based on coded audio sub-streams contributing to audio elements meant to be rendered and mixed to form one or more immersive presentations (Figure 2)
then indicates that this specification defines a hypothetical architecture based on pre-processing of the source signal to split the signal into substreams and determine metadata to drive the rendering and mixing, then based on separate encoding of the substreams and the metadata, and then on the combination of these to form a bitstream. You can reuse Figure 1 and the bullet list below, but with some updates (should not mention IAC file)
then indicates that this specification defines a way to store the bitstream into container formats for applications that need it

I don't think we need to list the rest of the specification. This is error prone and the specification already contains a table of content on the side.

As a result, section 1 and 2 should be merged and only 1 section "Introduction" should be used.

Determine the substream format for AAC-LC

Now, I am summarizing the definition of IA bitstream based on our discussion and is trying to define OPUS_IA bitstream and AAC-LC_IA bitstream. Suddenly, I have realized I need to know what would be the substream format (i.e. frame format) for AAC-LC.

In conventional case, ADTS is the frame format for AAC-LC and is input to AAC decoders.
Here is adts_frame() having just one single raw_data_block() for AAC-LC.
adts_frame() {
adts_fixed_header();
adts_variable_header();
adts_error_check();
raw_data_block();
}
However, when it is encapsulated in mp4a file, esds box and sample sizes, to be stored inside moov, are generated based on the headers of ADTS and raw_data_block() is only stored as the sample data.
Here, my point is that the frame format differs from the sample format.
NOTE: In OPUS case, both are same as it is one single opus packet of RFC 6716. So, it is clear for me.

For IAC, let's assume there is no OBU applied for simple discussion.
I think that the sample data (may be the part of sample data) associated to a substream should be raw_data_block().
Here are two options for the substream format:

Option1: ADTS
Option2: raw_data_block()

Consider defining a Temporal Delimiter OBU

Temporal Delimiter OBUs help simple parsers (e.g. in packagers or in players) to identify boundaries of bitstream parts that have the same timing. The drawback is the overhead associated with it. However, their use could be made optional, which means that:

encoders should generate them when it is desired to simplify the tasks of packagers or of simple readers but may remove them for applications where the overhead is too costly
there would exists 2 classes of readers: those that can process a bitstream without delimiters (e.g. full decoder) and those that cannot (e.g. simple parsers)
when stored in container formats that already have mechanisms to represent time boundaries, temporal delimiters should be removed upon storage and reinserted upon extraction.

Add common encryption support

Would it be useful to have sub-versions for specific OBUs?

Particularly for descriptor OBUs - codec configs, elements, and mix presentations - sub-versions could be a way for us to incrementally improve these descriptors without having to bump a full version. Would this be useful?

Restrictions on Opus-encoded streams

If I understand correctly the proposal, when Opus is used as the codec, not all configurations should be used. I think the following restrictions should apply:

OutputChannelCount shall be 1 or 2
ChannelMappingFamily shall be 0

Then there may be constraints on the Opus-encoded streams grouped in the same Channel Group:

should PreSkip be the same?
should InputSampleRate be the same?
should OutputGain be the same?

Compare downmixing specification to existing options

Compare fixed list of downmixing modes to those provided in the Opus-in-ISOBMFF spec and other ISOBMFF specifications.

Substream ordering when skipping IDs - proposed adjustment

If I understand current spec version correctly, the proposal is that when stream IDs are skipped, the order of substreams is expected to be specified by the order of audio element OBUs, and the ordering of substreams referred to by those elements.

A disadvantage of this is that it will place more requirements on ordering of OBUs in ways that could be avoided. I like the idea that descriptor OBUs are generally not forced to be in a specific order - this makes it less error prone for creating/consuming the format. Furthermore, the new definition of Sync OBU may become the "authoritative" information of which substreams and parameter blocks to expect in the following OBU sequence. I think it is not too much further to suggest that the order of entries Sync OBU could also tell us what ordering to expect substreams when stream IDs are skipped.

So, feedback and discussion requested - could we use Sync OBUs for this purpose, instead of relying on the ordering of Audio Element OBUs?

Change section 2's title to "Immersive Audio Bitstream Definition"

Determine if a new handler is needed

E.g. there is a VideoAuxillaryTrack but no corresponding AudioAuxillaryTrack.