Coder Social home page Coder Social logo

iamf's Introduction

iamf

Official specification of the AOM group for the Immersive Audio Model and Formats

The specification is written using a special syntax (mixing markup and markdown) to enable generation of cross-references, syntax highlighting, etc. The file using this syntax is index.bs.

index.bs is processed to produce an HTML version (index.html) by a tool called Bikeshed (https://github.com/tabatkins/bikeshed), which is run when content is pushed onto the main branch or when Pull Requests are made.

iamf's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

iamf's Issues

Consider restructuring of section 3

The goal of section 3 is to define the bitstream (or sequence as suggested #96). The current section has lots of introductory text (up to section 3.5 (included)) or text that is hard to understand without having seen the definitions and semantics. It also makes it error-prone as lots of text is repeated with the semantics.
I suggest that the section should simply consists of the semantics of the different elements forming a bitstream followed by their syntax. For example:
3.1 Sequence Definition
3.1.1. Syntax
3.1.2 Semantics
3.2 Audio OBU
3.2.1 Syntax
3.2.2 Semantics
...

Considerations about random access, synchronization, etc. should follow in a section 4.

Consider supporting multiple "concatenation rules" for Sync OBU

Please refer to AOM document that proposed newer definition of Sync OBU. In that document, a "concatenation rule" was proposed, which described how a parser would know how to position the audio frames and parameters that happened after a Sync OBU with respect to the timeline before the Sync OBU.

The rule proposed in the document works for "continuing" the same audio content across a Sync OBU.

Another rule that would be useful is to concatenate two separate pieces of audio content, where we would want to position new audio so that none of it overlaps the end of the previous audio. I believe the arithmetic for this rule would be just as simple as the other rule.

If we want to do this, we may want to add a few bits to the Sync OBU that can control which kind of concatenation rule should be used.

Determine if multiple tracks are needed to meet requirements

With a single track, sending a subset of the audio to a client requires the audio track to be disassembled and reassembled. With multiple tracks, the client can select the tracks it needs. Determine if the single track implementation is enough to meet the requirements.

Consider to align endianness

In Ogg Opus (RFC 7845), some of ID header information are defined in little endian while the information in dOps box are defined in big endian.

Define Random Access point

We should define at what ISOBMFF samples the stream can be seeked to (if not all of them).

In a normal audio stream, all ISOBMFF samples are random access points, though there is preroll in order to prime the decoder. We should determine if we can use the same definition, or if we need to restrict it.

OBU header optimizations

Not a priority yet, but something we probably want to address.

Current OBU header has so many bit flags it unavoidably adds a byte to the header. This could become an unnecessary bitrate overhead, adding at least 0.5 kbps or more.

With the new proposal that has been agreed on, there may be more opportunities for reducing header. Here are a few strategies and examples of what we could do:

  • I believe we've already agreed we won't need the sync offset bit.
  • fold the extension bit into the obu_type enum. There are currently 7 obu types. the upper bit of obu_type could be used to indicate both "additional types" and "use extension header size".
  • duration and redundancy bits are only needed for parameter blocks, while trimming bits are only needed for audio blocks. Is it possible we could have flags that change depending on obu type? in this case we could save 2 bits.
  • If needed later, a 3-bit "code" could potentially be a way to represent values of 4 flags if we know there would be less than 8 practical combinations of those flags.

at least with the current status of spec, this would bring us back to 7 or 8 bits for all flags rather than 16.

What should be the `codecs` parameter of the single track approach?

Current audio codecs have a MIME codecs parameter of the following forms:

  • Opus
  • mp4a.40.2

We should define what the MIME codecs parameter should be for the single track case. The purpose of the codecs parameter is for a receiver to determine if it can process the track without having to download anything (no header, no initialization segment, ...).

Do we expect that all IAC decoders will be able to process every file or do we expect that some files won't be processable? In the latter case, the codecs should contain something. Also, we should recursively convey the underlying codecs parameter. For example something like: aiac.<IAC-specific-needs>.mp4a.40.2 or aiac.<IAC-specific-needs>.Opus.

Consider a new title for the specification

The specification defines multiple things:

  • an immersive audio architecture and model
  • a standalone bitstream format
  • an ISOBMFF-based format

The current title does not reflect that. We should consider a new title, such as Immersive Audio Model and Formats (IAMF).

Update section title to clarify the conformance points

According to meeting note for July 11 (https://docs.google.com/document/d/1f68rle2VcwObrufwcwnYuPhRyLlNunZZHdblRzMsZ4A/edit#), need to update introduction section and Informative annex to meet the meeting note.
The purpose of this issue is just to update introduction section and the titles for annexes.

NOTE: After updating introduction section and titles of annex sections as resolving this issue, following issues need to be come up to update actual spec text to specify conformance points for parsing, decoding, rendering etc..

Rules for the use of a new sample description entry

A common practice in ISOBMFF is to splice files, like 2 programs or one program and an ad. Splicing in ISOBMFF is not as simple as with MPEG-2 TS, unfortunately. The splicer has to merge the moov boxes from both files into 1 moov. In the case of a single-track, this means merging the trak boxes and in particular they stsd boxes. Each codec needs to specify the rules when sample description entries can be merged. This should be done in this case too.

Use SampleGroups to collect timed metadata

Unlike video, audio does not have keyframes (alternately, every sample is a keyframe). Consider using SampleGroups to avoid excessive numbers of timed metadata units.

Consider using another term than IAB

Immersive Audio Bitstream (IAB) is already defined in SMPTE ST 2098-2. Consider defining a new name. I suggest using "Immersive Audio Sequence".

Improve the convention section

Section 1.1 says:

All of obu syntax is described in class which is a structure of C++ program language.

It is not correct. Lots of OBU syntaxes use functions. Consider explaining a bit more how functions are used, e.g. clearly indicating that the conventions are the ones used in the AV1 video specification, the entire section 4. Alternatively consider using MPEG-4 Syntax Description Language.

Section 1.2 defines leb128 as a function in a section called "Type". Use an explicit reference to the AV1 video specification for this function.

Section 1.3 defines the clip3 function in an unclear way. Consider using the exact definition of AV1.

Syntax incomplete on parameter blocks

We have some concrete ideas brewing to fill in this information, hopefully to be proposed within a few days.

One quirk worth mentioning sooner than later, it may require creating new syntax rules (new data types that are used to declare both a self-documenting intuitive field name (e.g. master_mix_gain) and a function/class that has a bit more syntax (e.g., parse the parameter ID and a default gain value).

Determine whether we take "OBU_Substream" or "OBU_IA_Coded_Data"

Let's assume that there is N Substreams for each IA frame, then
.Option1 - OBU_IA_Coded_Data requires 3+ 2N bytes (3 bytes for OBU syntax and 2*N bytes for Substream_Boundar_Info)
.Option2 - OBU_Substream requires 3 x N bytes for OBU syntax.

In overhead point of view, Option 2 needs additional N-3 bytes compared to Option 1.
. Option1 is better if N > 3
. Even if N = 3
. Option2 is better if N < 3

In another point of view, Option 2 is well aligned with the philosophy that IA bitstream has been designed to reuse conventional Substream-based encoding and decoding scheme. Option2 does not require Substream_Boundary_Info_OBU while Option1 does it.

Proposal: parameters should not be allowed to automate anything about decoding substreams

In the current version of spec, there is mention that parameters may be used to provide time-varying changes to decode, reconstruction, rendering, and mixing.

As we've been tinkering superficially with implementation, it becomes clear that parameters should not affect decoding of substreams. It should only be applicable to audio element OBUs and mix presentation OBUs (i.e., reconstruction, rendering, and mixing). I do feel it does not significantly limit the expressive power of the concepts we have.

Improve the introduction

The introduction could be improved:

  • Some context is missing before defining what is "IA bitstream".
  • It overlaps with section 3 on the Overview

I suggest restructuring as follows:

  • start by defining what "immersive audio" means, for example, "the combination of 3D audio signals recreating a sound experience close to that of a natural environment"
  • then indicates that this specification defines a model for representing immersive audio content based on coded audio sub-streams contributing to audio elements meant to be rendered and mixed to form one or more immersive presentations (Figure 2)
  • then indicates that this specification defines a hypothetical architecture based on pre-processing of the source signal to split the signal into substreams and determine metadata to drive the rendering and mixing, then based on separate encoding of the substreams and the metadata, and then on the combination of these to form a bitstream. You can reuse Figure 1 and the bullet list below, but with some updates (should not mention IAC file)
  • then indicates that this specification defines a way to store the bitstream into container formats for applications that need it

I don't think we need to list the rest of the specification. This is error prone and the specification already contains a table of content on the side.

As a result, section 1 and 2 should be merged and only 1 section "Introduction" should be used.

Determine the substream format for AAC-LC

Now, I am summarizing the definition of IA bitstream based on our discussion and is trying to define OPUS_IA bitstream and AAC-LC_IA bitstream. Suddenly, I have realized I need to know what would be the substream format (i.e. frame format) for AAC-LC.

In conventional case, ADTS is the frame format for AAC-LC and is input to AAC decoders.
Here is adts_frame() having just one single raw_data_block() for AAC-LC.
adts_frame() {
adts_fixed_header();
adts_variable_header();
adts_error_check();
raw_data_block();
}
However, when it is encapsulated in mp4a file, esds box and sample sizes, to be stored inside moov, are generated based on the headers of ADTS and raw_data_block() is only stored as the sample data.
Here, my point is that the frame format differs from the sample format.
NOTE: In OPUS case, both are same as it is one single opus packet of RFC 6716. So, it is clear for me.

For IAC, let's assume there is no OBU applied for simple discussion.
I think that the sample data (may be the part of sample data) associated to a substream should be raw_data_block().
Here are two options for the substream format:

  • Option1: ADTS
  • Option2: raw_data_block()

Consider defining a Temporal Delimiter OBU

Temporal Delimiter OBUs help simple parsers (e.g. in packagers or in players) to identify boundaries of bitstream parts that have the same timing. The drawback is the overhead associated with it. However, their use could be made optional, which means that:

  • encoders should generate them when it is desired to simplify the tasks of packagers or of simple readers but may remove them for applications where the overhead is too costly
  • there would exists 2 classes of readers: those that can process a bitstream without delimiters (e.g. full decoder) and those that cannot (e.g. simple parsers)
  • when stored in container formats that already have mechanisms to represent time boundaries, temporal delimiters should be removed upon storage and reinserted upon extraction.

Restrictions on Opus-encoded streams

If I understand correctly the proposal, when Opus is used as the codec, not all configurations should be used. I think the following restrictions should apply:

  • OutputChannelCount shall be 1 or 2
  • ChannelMappingFamily shall be 0

Then there may be constraints on the Opus-encoded streams grouped in the same Channel Group:

  • should PreSkip be the same?
  • should InputSampleRate be the same?
  • should OutputGain be the same?

Substream ordering when skipping IDs - proposed adjustment

If I understand current spec version correctly, the proposal is that when stream IDs are skipped, the order of substreams is expected to be specified by the order of audio element OBUs, and the ordering of substreams referred to by those elements.

A disadvantage of this is that it will place more requirements on ordering of OBUs in ways that could be avoided. I like the idea that descriptor OBUs are generally not forced to be in a specific order - this makes it less error prone for creating/consuming the format. Furthermore, the new definition of Sync OBU may become the "authoritative" information of which substreams and parameter blocks to expect in the following OBU sequence. I think it is not too much further to suggest that the order of entries Sync OBU could also tell us what ordering to expect substreams when stream IDs are skipped.

So, feedback and discussion requested - could we use Sync OBUs for this purpose, instead of relying on the ordering of Audio Element OBUs?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.