aomediacodec / iamf Goto Github PK
View Code? Open in Web Editor NEWImmersive Audio Model and Formats
Home Page: https://aomediacodec.github.io/iamf/
Immersive Audio Model and Formats
Home Page: https://aomediacodec.github.io/iamf/
The specification contains merge errors like
<<<<<<< HEAD
or
=======
If some audio channels are not present in certain SampleGroups, etc, we still need a way to match up streams that should go to the same decoder. There doesn't seem to be enough information on how to do that in the existing specification.
If I understand correctly the proposal, when Opus is used as the codec, not all configurations should be used. I think the following restrictions should apply:
OutputChannelCount
shall be 1 or 2ChannelMappingFamily
shall be 0Then there may be constraints on the Opus-encoded streams grouped in the same Channel Group:
PreSkip
be the same?InputSampleRate
be the same?OutputGain
be the same?Each internal decoder includes its own original AudioSampleEntry. This is potentially large, and is also not currently backwards compatible.
Section 3 is dedicated to the encapsulation of ISOBMFF.
Now, I am summarizing the definition of IA bitstream based on our discussion and is trying to define OPUS_IA bitstream and AAC-LC_IA bitstream. Suddenly, I have realized I need to know what would be the substream format (i.e. frame format) for AAC-LC.
In conventional case, ADTS is the frame format for AAC-LC and is input to AAC decoders.
Here is adts_frame() having just one single raw_data_block() for AAC-LC.
adts_frame() {
adts_fixed_header();
adts_variable_header();
adts_error_check();
raw_data_block();
}
However, when it is encapsulated in mp4a file, esds box and sample sizes, to be stored inside moov, are generated based on the headers of ADTS and raw_data_block() is only stored as the sample data.
Here, my point is that the frame format differs from the sample format.
NOTE: In OPUS case, both are same as it is one single opus packet of RFC 6716. So, it is clear for me.
For IAC, let's assume there is no OBU applied for simple discussion.
I think that the sample data (may be the part of sample data) associated to a substream should be raw_data_block().
Here are two options for the substream format:
The specification defines multiple things:
The current title does not reflect that. We should consider a new title, such as Immersive Audio Model and Formats (IAMF).
If I understand current spec version correctly, the proposal is that when stream IDs are skipped, the order of substreams is expected to be specified by the order of audio element OBUs, and the ordering of substreams referred to by those elements.
A disadvantage of this is that it will place more requirements on ordering of OBUs in ways that could be avoided. I like the idea that descriptor OBUs are generally not forced to be in a specific order - this makes it less error prone for creating/consuming the format. Furthermore, the new definition of Sync OBU may become the "authoritative" information of which substreams and parameter blocks to expect in the following OBU sequence. I think it is not too much further to suggest that the order of entries Sync OBU could also tell us what ordering to expect substreams when stream IDs are skipped.
So, feedback and discussion requested - could we use Sync OBUs for this purpose, instead of relying on the ordering of Audio Element OBUs?
Better to keep byte alignment among informations within metadata OBUs for convienient IO write/read of players.
According to meeting note for July 11 (https://docs.google.com/document/d/1f68rle2VcwObrufwcwnYuPhRyLlNunZZHdblRzMsZ4A/edit#), need to update introduction section and Informative annex to meet the meeting note.
The purpose of this issue is just to update introduction section and the titles for annexes.
NOTE: After updating introduction section and titles of annex sections as resolving this issue, following issues need to be come up to update actual spec text to specify conformance points for parsing, decoding, rendering etc..
Also compare to existing channel layout description formats.
Let's assume that there is N Substreams for each IA frame, then
.Option1 - OBU_IA_Coded_Data requires 3+ 2N bytes (3 bytes for OBU syntax and 2*N bytes for Substream_Boundar_Info)
.Option2 - OBU_Substream requires 3 x N bytes for OBU syntax.
In overhead point of view, Option 2 needs additional N-3 bytes compared to Option 1.
. Option1 is better if N > 3
. Even if N = 3
. Option2 is better if N < 3
In another point of view, Option 2 is well aligned with the philosophy that IA bitstream has been designed to reuse conventional Substream-based encoding and decoding scheme. Option2 does not require Substream_Boundary_Info_OBU while Option1 does it.
The introduction could be improved:
I suggest restructuring as follows:
I don't think we need to list the rest of the specification. This is error prone and the specification already contains a table of content on the side.
As a result, section 1 and 2 should be merged and only 1 section "Introduction" should be used.
E.g. there is a VideoAuxillaryTrack but no corresponding AudioAuxillaryTrack.
It's on our radar, and with any luck we may have some concrete PRs coming soon.
We have some concrete ideas brewing to fill in this information, hopefully to be proposed within a few days.
One quirk worth mentioning sooner than later, it may require creating new syntax rules (new data types that are used to declare both a self-documenting intuitive field name (e.g. master_mix_gain) and a function/class that has a bit more syntax (e.g., parse the parameter ID and a default gain value).
With a single track, sending a subset of the audio to a client requires the audio track to be disassembled and reassembled. With multiple tracks, the client can select the tracks it needs. Determine if the single track implementation is enough to meet the requirements.
Please refer to AOM document that proposed newer definition of Sync OBU. In that document, a "concatenation rule" was proposed, which described how a parser would know how to position the audio frames and parameters that happened after a Sync OBU with respect to the timeline before the Sync OBU.
The rule proposed in the document works for "continuing" the same audio content across a Sync OBU.
Another rule that would be useful is to concatenate two separate pieces of audio content, where we would want to position new audio so that none of it overlaps the end of the previous audio. I believe the arithmetic for this rule would be just as simple as the other rule.
If we want to do this, we may want to add a few bits to the Sync OBU that can control which kind of concatenation rule should be used.
In the current version of spec, there is mention that parameters may be used to provide time-varying changes to decode, reconstruction, rendering, and mixing.
As we've been tinkering superficially with implementation, it becomes clear that parameters should not affect decoding of substreams. It should only be applicable to audio element OBUs and mix presentation OBUs (i.e., reconstruction, rendering, and mixing). I do feel it does not significantly limit the expressive power of the concepts we have.
As for the title in #92, the abstract could use a bit more explanation.
I think this is covered by #105 but it was explicitly mentioned so I want to give it its own issue.
Current audio codecs have a MIME codecs
parameter of the following forms:
Opus
mp4a.40.2
We should define what the MIME codecs
parameter should be for the single track case. The purpose of the codecs
parameter is for a receiver to determine if it can process the track without having to download anything (no header, no initialization segment, ...).
Do we expect that all IAC decoders will be able to process every file or do we expect that some files won't be processable? In the latter case, the codecs
should contain something. Also, we should recursively convey the underlying codecs
parameter. For example something like: aiac.<IAC-specific-needs>.mp4a.40.2
or aiac.<IAC-specific-needs>.Opus
.
Section 1.1 says:
All of obu syntax is described in class which is a structure of C++ program language.
It is not correct. Lots of OBU syntaxes use functions. Consider explaining a bit more how functions are used, e.g. clearly indicating that the conventions are the ones used in the AV1 video specification, the entire section 4. Alternatively consider using MPEG-4 Syntax Description Language.
Section 1.2 defines leb128
as a function in a section called "Type". Use an explicit reference to the AV1 video specification for this function.
Section 1.3 defines the clip3
function in an unclear way. Consider using the exact definition of AV1.
In Ogg Opus (RFC 7845), some of ID header information are defined in little endian while the information in dOps box are defined in big endian.
Unlike video, audio does not have keyframes (alternately, every sample is a keyframe). Consider using SampleGroups to avoid excessive numbers of timed metadata units.
Compare fixed list of downmixing modes to those provided in the Opus-in-ISOBMFF spec and other ISOBMFF specifications.
A common practice in ISOBMFF is to splice files, like 2 programs or one program and an ad. Splicing in ISOBMFF is not as simple as with MPEG-2 TS, unfortunately. The splicer has to merge the moov
boxes from both files into 1 moov
. In the case of a single-track, this means merging the trak
boxes and in particular they stsd
boxes. Each codec needs to specify the rules when sample description entries can be merged. This should be done in this case too.
Not a priority yet, but something we probably want to address.
Current OBU header has so many bit flags it unavoidably adds a byte to the header. This could become an unnecessary bitrate overhead, adding at least 0.5 kbps or more.
With the new proposal that has been agreed on, there may be more opportunities for reducing header. Here are a few strategies and examples of what we could do:
at least with the current status of spec, this would bring us back to 7 or 8 bits for all flags rather than 16.
Immersive Audio Bitstream (IAB) is already defined in SMPTE ST 2098-2. Consider defining a new name. I suggest using "Immersive Audio Sequence".
Temporal Delimiter OBUs help simple parsers (e.g. in packagers or in players) to identify boundaries of bitstream parts that have the same timing. The drawback is the overhead associated with it. However, their use could be made optional, which means that:
The goal of section 3 is to define the bitstream (or sequence as suggested #96). The current section has lots of introductory text (up to section 3.5 (included)) or text that is hard to understand without having seen the definitions and semantics. It also makes it error-prone as lots of text is repeated with the semantics.
I suggest that the section should simply consists of the semantics of the different elements forming a bitstream followed by their syntax. For example:
3.1 Sequence Definition
3.1.1. Syntax
3.1.2 Semantics
3.2 Audio OBU
3.2.1 Syntax
3.2.2 Semantics
...
Considerations about random access, synchronization, etc. should follow in a section 4.
The Opus in ISOBMFF specification makes an edit list mandatory. We should do something similar, either in the bitstream, or as an edit list.
We should define at what ISOBMFF samples the stream can be seeked to (if not all of them).
In a normal audio stream, all ISOBMFF samples are random access points, though there is preroll in order to prime the decoder. We should determine if we can use the same definition, or if we need to restrict it.
The description of audio OBUs should be self contained within the IAC spec.
Particularly for descriptor OBUs - codec configs, elements, and mix presentations - sub-versions could be a way for us to incrementally improve these descriptors without having to bump a full version. Would this be useful?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.