MP4 With Auxiliary Tracks Extension (MP4-AT) File Format 0.9

The MP4-AT file format supports storing auxiliary tracks that are useful for post-capture editing and composition (for example, a depth map video track) alongside playable media data in an ISOBMFF/MP4 structure.

The goal of the format is to store auxiliary tracks such that the tracks are hidden from clients not implementing this spec. This prevents clients from interpreting auxiliary tracks as playable data.

Dependencies

The following are normative references for this specification:

Introduction

The use of "MUST", "MUST NOT", "REQUIRED", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" is per the IETF standard defined in RFC2119.

MP4-AT file format

The MP4-AT file format consists of primary tracks and auxiliary tracks to enable various editing operations. The primary tracks (for example, a video track that has had a bokeh effect applied to it) are written in the MP4 file as usual, whereas the auxiliary tracks are written in an Auxiliary Tracks MP4. The Auxiliary Tracks MP4 is another MP4 compliant container, and is placed inside the axte(Auxiliary Tracks Extension) box. The axte box is recommended to be the last box in the file, which makes it convenient to remove auxiliary data by truncating the file.

This format is backward compatible: players that don't support the rest of this format will read and play the primary video tracks when loading the file.

Line diagram demonstrating the arrangement of elements in an MP4-AT file

The file has a moov.meta box with mdta handler that contains the following metadata. The metadata may appear in any order.

Metadata key	Type indicator	Value
`auxiliary.tracks.offset`	78 (big endian 64-bit unsigned integer)	The file offset (in bytes) of the `axte` box
`auxiliary.tracks.length`	78 (big endian 64-bit unsigned integer)	The length (in bytes) of the `axte` box

Auxiliary tracks extension (axte) box

Syntax

The axte box is described using the semantics of the box defined in ISO/IEC 14496-12:2022: 4.2

aligned(8) class AuxiliaryTracksExtensionBox extends Box('axte') {
  bit(8) data[];
}

where the data field contains the Auxiliary Tracks MP4.

Payload

The axte box's payload is an Auxiliary Tracks MP4. The Auxiliary Tracks MP4 has the usual MP4 structure.

Line diagram demonstrating the arrangement of elements in the Auxiliary Tracks MP4

The Auxiliary Tracks MP4 contains sample metadata for all auxiliary tracks. All auxiliary track sample payloads must be stored either in the Auxiliary Tracks MP4's mdat box, or in the outer MP4's mdat box (but not both).

In the former case, auxiliary.tracks.interleaved must be set to 0 (see "Static Metadata" below) and the sample offsets in the axte.moov box are relative to the start of the Auxiliary Tracks MP4. This makes the Auxiliary Tracks MP4 self contained, which means the Auxiliary Tracks MP4 can be read standalone without any references to the outer MP4.

In the latter case, auxiliary.tracks.interleaved must be set to 1 (see "Static Metadata" below) and the sample offsets in the axte.moov box are relative to the start of the file and the sample payloads of the primary and auxiliary tracks may be interleaved. The axte.mdat box can be absent in this case.

Static metadata

The Auxiliary Tracks MP4 contains a moov.meta box with mdta handler that contains the following metadata. The metadata may appear in any order.

Metadata key

Type indicator

Value

(Optional) auxiliary.tracks.interleaved

75 (8-bit Unsigned Integer)

0: Indicates samples are not interleaved and are in the axte.mdat box

1: Indicates samples are interleaved in the primary video track's mdat box

All other values are reserved and must not be used.

Absence of this metadata indicates default value 0.

auxiliary.tracks.map

0 (reserved)

Binary format:

1 byte version = 1
1 byte track count = n
n bytes track types from following set
- 0 = Sharp video
- 1 = Depth video (linear)
- 2 = Depth video (inverse)
- 3 = Timed depth metadata
- 4 = Translucent video
- 5-127 = Reserved for future use
- 128-255 = Custom track types

The order of track types in the auxiliary.tracks.map indicates their order in the Auxiliary Tracks MP4's payload.

Auxiliary track types

The Auxiliary Tracks MP4 may contain following video and metadata tracks useful for editing.

Sharp video track

A video at full resolution without editable effects applied. The video track may be stored at a different resolution than the primary video track. The sharp video track may use any common video codec, and may be in standard or high dynamic range.

Depth video track

The depth video track provides the depth information encoded as a standard grayscale video. This is to allow decoding and encoding depth tracks on devices that don't have any special decoding or encoding support for depth. The depth video track may use H.264/AVC, H.265/HEVC, VP9, AV1 or any other common video codec. The depth video track can be 8-bit or 10-bit and linear- or inverse-encoded (refer to the Dynamic depth 1.0 spec).

Timed depth metadata track

The timed depth metadata track contains normalizing values to calculate depth, and a focal table that can be used to calculate the blur radius for a bokeh effect.

Sample mime type

application/x-depth-metadata

Sample syntax

Binary format (all ints little endian):

Near distance (16-bit float)
Far distance (16-bit float)
Focal table entry count (16-bit int)
Focal table entry
- Entry distance (16-bit float)
- Entry radius (16-bit float)

Translucent video track

A video track storing the alpha value (transparency) for each pixel in the corresponding frame. A minimum value indicates fully transparent, while the maximum value indicates full opacity. Values in between represent varying levels of translucency on a linear scale, and compositing uses the normal blending mode with non-pre-multiplied color values. Similar to the depth video track, this track should also be encoded as a standard grayscale video.

Example use cases

Storing a playable rendered bokeh video in a primary track, with auxiliary video tracks for the original (pre-blurring) sharp color data and a depth map, and an auxiliary timed metadata track with depth metadata reflecting the focus point at each frame. The auxiliary tracks can then be used in a video editor to modify the focus subject and re-render a high quality bokeh video track.
Storing a pre-rendered translucent 'sticker' video, for example, an animated emoji video on a white background in a primary video track, with an auxiliary video track containing an alpha map. The auxiliary track can then be used by a compositor to blend the sticker with a background using translucency information from the auxiliary track.