The MP4-AT file format supports storing auxiliary tracks that are useful for post-capture editing and composition (for example, a depth map video track) alongside playable media data in an ISOBMFF/MP4 structure.
The goal of the format is to store auxiliary tracks such that the tracks are hidden from clients not implementing this spec. This prevents clients from interpreting auxiliary tracks as playable data.
Dependencies
The following are normative references for this specification:
- Key words for use in RFCs to Indicate Requirement Levels
- ISO/IEC 14496-12:2022 ISO Box media file format (ISOBMFF/MP4)
- ISO/IEC 14496-10:2022 Coding of audio-visual objects Part 10: Advanced video coding (AVC)
- ISO/IEC 23008-2:2023 High efficiency coding and media delivery in heterogeneous environments Part 2: High efficiency video coding (HEVC)
- VP9 Video Codecs (VP9)
- AV1 Bitstream & Decoding Process Specification (AV1)
- Dynamic depth 1.0 spec
Introduction
The use of "MUST", "MUST NOT", "REQUIRED", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" is per the IETF standard defined in RFC2119.
MP4-AT file format
The MP4-AT file format consists of primary tracks and auxiliary tracks to
enable various editing operations. The primary tracks (for example, a video
track that has had a bokeh effect applied to it) are written in the MP4 file as
usual, whereas the auxiliary tracks are written in an Auxiliary Tracks MP4.
The Auxiliary Tracks MP4 is another MP4 compliant container, and is placed
inside the axte
(Auxiliary Tracks Extension) box. The axte
box is
recommended to be the last box in the file, which makes it convenient to remove
auxiliary data by truncating the file.
This format is backward compatible: players that don't support the rest of this format will read and play the primary video tracks when loading the file.
The file has a moov.meta
box with mdta
handler that contains the following
metadata. The metadata may appear in any order.
Metadata key |
Type indicator |
Value |
|
78 (big endian 64-bit unsigned integer) |
The file offset (in bytes) of the |
|
78 (big endian 64-bit unsigned integer) |
The length (in bytes) of the |
Auxiliary tracks extension (axte) box
Syntax
The axte
box is described using the semantics of the box defined in
ISO/IEC 14496-12:2022: 4.2
aligned(8) class AuxiliaryTracksExtensionBox extends Box('axte') {
bit(8) data[];
}
where the data field contains the Auxiliary Tracks MP4.
Payload
The axte
box's payload is an Auxiliary Tracks MP4.
The Auxiliary Tracks MP4 has the usual MP4 structure.
The Auxiliary Tracks MP4 contains sample metadata for all auxiliary tracks.
All auxiliary track sample payloads must be stored either in
the Auxiliary Tracks MP4's mdat
box, or in the outer MP4's mdat
box
(but not both).
In the former case, auxiliary.tracks.interleaved
must be set to 0
(see "Static Metadata" below) and the sample offsets
in the axte.moov
box are relative to the start of
the Auxiliary Tracks MP4. This makes the Auxiliary Tracks MP4
self contained, which means the Auxiliary Tracks MP4 can be read standalone
without any references to the outer MP4.
In the latter case, auxiliary.tracks.interleaved
must be set to 1
(see "Static Metadata" below) and the sample offsets in
the axte.moov
box are relative to the start of the file and the sample
payloads of the primary and auxiliary tracks may be interleaved.
The axte.mdat
box can be absent in this case.
Static metadata
The Auxiliary Tracks MP4 contains a moov.meta
box with mdta
handler that
contains the following metadata. The metadata may appear in any order.
Metadata key |
Type indicator |
Value |
(Optional) |
75 (8-bit Unsigned Integer) |
0: Indicates samples are not interleaved and are in the 1: Indicates samples are interleaved in the primary video track's All other values are reserved and must not be used. Absence of this metadata indicates default value 0. |
|
0 (reserved) |
Binary format:
|
The order of track types in the auxiliary.tracks.map
indicates their order in
the Auxiliary Tracks MP4's payload.
Auxiliary track types
The Auxiliary Tracks MP4 may contain following video and metadata tracks useful for editing.
Sharp video track
A video at full resolution without editable effects applied. The video track may be stored at a different resolution than the primary video track. The sharp video track may use any common video codec, and may be in standard or high dynamic range.
Depth video track
The depth video track provides the depth information encoded as a standard grayscale video. This is to allow decoding and encoding depth tracks on devices that don't have any special decoding or encoding support for depth. The depth video track may use H.264/AVC, H.265/HEVC, VP9, AV1 or any other common video codec. The depth video track can be 8-bit or 10-bit and linear- or inverse-encoded (refer to the Dynamic depth 1.0 spec).
Timed depth metadata track
The timed depth metadata track contains normalizing values to calculate depth, and a focal table that can be used to calculate the blur radius for a bokeh effect.
Sample mime type |
|
Sample syntax |
Binary format (all ints little endian):
|
Translucent video track
A video track storing the alpha value (transparency) for each pixel in the corresponding frame. A minimum value indicates fully transparent, while the maximum value indicates full opacity. Values in between represent varying levels of translucency on a linear scale, and compositing uses the normal blending mode with non-pre-multiplied color values. Similar to the depth video track, this track should also be encoded as a standard grayscale video.
Example use cases
Storing a playable rendered bokeh video in a primary track, with auxiliary video tracks for the original (pre-blurring) sharp color data and a depth map, and an auxiliary timed metadata track with depth metadata reflecting the focus point at each frame. The auxiliary tracks can then be used in a video editor to modify the focus subject and re-render a high quality bokeh video track.
Storing a pre-rendered translucent 'sticker' video, for example, an animated emoji video on a white background in a primary video track, with an auxiliary video track containing an alpha map. The auxiliary track can then be used by a compositor to blend the sticker with a background using translucency information from the auxiliary track.