mp4文件格式系列1 - 综述

最新推荐文章于 2024-06-08 16:37:05 发布

原创最新推荐文章于 2024-06-08 16:37:05 发布 · 1.1k 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#table #reference #random #access #header #audio

Codec 专栏收录该内容

5 篇文章

订阅专栏

本文深入探讨了MP4文件格式，包括其核心概念如movie container、track、sample，以及物理结构、时间结构、交织和组合。MP4文件不局限于媒体的物理格式，允许媒体数据在不同文件中分散，通过metadata描述其排列和时间信息。文件数据存储在box中，metadata定义媒体数据的位置和大小。此外，文件支持多轨道，如视频、音频和提示轨道。时间结构不受物理顺序约束，允许通过edit list和time scale灵活处理时间信息。文件的交织不是sample级别，而是chunk级别，以优化存储和读取。最后，文章介绍了如何解析MP4文件以找到特定时间的媒体数据。

Overview and Introduction

Core Concepts

MP4文件格式中，所有的内容存在一个称为movie的容器中。一个movie可以由多个tracks组成。每个track就是一个随时间变化的媒体序列，例如，视频帧序列。track里的每个时间单位是一个sample，它可以是一帧视频，或者音频。sample按照时间顺序排列。注意，一帧音频可以分解成多个音频sample，所以音频一般用sample作为单位，而不用帧。MP4文件格式的定义里面，用sample这个单词表示一个时间帧或者数据单元。每个track会有一个或者多个sample descriptions。track里面的每个sample通过引用关联到一个sample description。这个sample descriptions定义了怎样解码这个sample，例如使用的压缩算法。

与其他的多媒体文件格式不同的是，MP4文件格式经常使用几个不同的概念，理解其不同是理解这个文件格式的关键。

这个文件的物理格式没有限定媒体本身的格式。例如，许多文件格式将媒体数据分成帧，头部或者其他数据紧紧跟随每一帧视频，！！！TODO（例如MPEG2）。而MP4文件格式不是如此。

文件的物理格式和媒体数据的排列都不受媒体的时间顺序的限制。视频帧不需要在文件按时间顺序排列。这就意味着如果文件中真的存在这样的一些帧，那么就有一些文件结构来描述媒体的排列和对应的时间信息。

MP4文件中所有的数据都封装在一些box中（以前叫atom）。所有的metadata(媒体描述元数据)，包括定义媒体的排列和时间信息的数据都包含在这样的一些结构box中。MP4文件格式定义了这些这些box的格式。Metadata对媒体数据（例如，视频帧）引用说明。媒体数据可以包含在同一个的一个或多个box里，也可以在其他文件中，metadata允许使用URLs来引用其他的文件，而媒体数据在这些引用文件中的排列关系全部在第一个主文件中的metadata描述。其他的文件不一定是MP4文件格式，例如，可能就没有一个box。

有很多种类的track，其中有三个最重要，video track包含了视频sample；audio track包含了audio sample；hint track稍有不同，它描述了一个流媒体服务器如何把文件中的媒体数据组成符合流媒体协议的数据包。如果文件只是本地播放，可以忽略hint track，他们只与流媒体有关系。

Physical structure of the media

Box定义了如何在sample table中找到媒体数据的排列。这包括data reference(数据引用), the sample size table, the sample to chunk table, and the chunk offset table. 这些表就可以找到track中每个sample在文件中的位置和大小。

data reference允许在第二个媒体文件中找到媒体的位置。这样，一部电影就可以由一个媒体数据库中的多个不同文件组成，而且不用把它们全部拷贝到另一个新文件中。例如，对视频编辑就很有帮助。

为了节约空间，这些表都很紧凑。另外，interleave不是sample by sample，而是把单个track的几个samples组合到一起，然后另外几个sample又进行新的组合，等等。一个track的连续几个sample组成的单元就被称为chunk。每个chunk在文件中有一个偏移量，这个偏移量是从文件开头算起的，在这个chunk内，sample是连续存储的。

这样，如果一个chunk包含两个sample，第二个sample的位置就是chunk的偏移量加上第一个sample的大小。chunk offset table说明了每个chunk的偏移量，sample to chunk table说明了sample序号和chunk序号的映射关系。

注意chunk之间可能会有死区，没有任何媒体数据引用到这部分区域，但是chunk内部不会有这样的死区。这样，如果在节目编辑的时候，不需要一些媒体数据，就可以简单的留在那里，而不用引用，这样就不用删除它们了。类似的，如果媒体存放在第二个文件中，但是格式不同于MP4文件格式，这个陌生文件的头部或者其他文件格式都可以简单忽略掉。

Temporal structure of the media

文件中的时间可以理解为一些结构。电影以及每个track都有一个timescale。它定义了一个时间轴来说明每秒钟有多少个ticks。合理的选择这个数目，就可以实现准确的计时。一般来说，对于audio track，就是audio的sampling rate。对于video track，情况稍微复杂，需要合理选择。例如，如果一个media TimeScale是30000，media sample durations是1001，就准确的定义了NTSC video的时间格式（虽然不准确，但一般就是29.97），and provide 19.9 hours of time in 32 bits.

Track的时间结构受一个edit list影响，有两个用途：全部电影中的一个track的一部分时间片断变化（有可能是重用）；空白时间的插入，也就是空的edits。特别注意的是如果一个track不是从节目开头部分开始，edit list的第一个edit就一定是空的edit。

每个track的全部duration定义在文件头部，这就是对track的总结，每个sample有一个规定的duration。一个sample的准确描述时间，也就是他的时间戳(time-stamp)就是以前的sample的duration之和。

Interleave

文件的时间和物理结构可以是对齐的，这表明媒体数据在容器中的物理顺序就是时间顺序。另外，如果多个track的媒体数据包含在同一个文件中，这个媒体数据可以是interleaved。一般来说，为了方便读取一个track的媒体数据，同时保证每个表紧凑，以一个合适的时间间隔（例如1秒）做一次interleave，而不是sample by sample。这样就可以减少chunk的数据，减小chunk offset table的大小。

Composition

如果多个audio track包含在同一个文件中，他们有可能被混合在一起进行播放，并且由一个总track volume和左/右balance控制。

类似的，video track也可以根据各自的层次序列号（从后向前）和合成模式进行混合。另外，每个track可以用一个matrix进行变换，也可以全部电影用一个matrix进行变换。这样既可以进行简单操作（例如放大图像，校正90º 旋转），也可以做更复杂的操作（例如shearing, arbitrary rotation）。

这个混合方法只是非常简单，是一个缺省的方法，MPEG4的另一份文档会定义更强有力的方法（例如MPEG-4 BIFS）

Darwin Streaming Server里面有一些很好的工具，可以帮助分析mp4文件格式。

但是如果可以自己逐字节的parse文件，可以更好的了解mp4文件格式。这里我就逐字节的分析文件结构。文件例子是DSS里面包含的sample_100kbit.mp4

从tkhd – track header atom中找出audio track的time scale即是声音的采样频率。

Movie atom定义了一部电影的数据信息。它的类型是'moov'，是一个容器atom，至少必须包含三种atom中的一种—movie header atom('mvhd'), compressed movie atom('cmov')和reference movie atom ('rmra')。没有压缩的 movie header atom必须至少包含movie header atom 和reference movie atom中的一种。也可以包含其他的atom，例如一个clipping atom ('clip')，一个或几个track atoms ('trak')，一个color table atom ('ctab')，和一个user data atom ('udta')。其中movie header atom定义了整部电影的time scale，duration信息以及display characteristics。track atom定义了电影中一个track的信息。Track就是电影中可以独立操作的媒体单位，例如一个声道就是一个track。

Compressed movie atoms 和reference movie atoms 不太使用，不在本文讨论范围内。本文主要讨论uncompressed movie atoms。

moov atom format

字段	长度(字节)	描述
尺寸	4	这个movie header atom的字节数
类型	4	moov

以下是实际的sample_100kbit.mp4的部分字节，可以看到结果是

mp4文件格式系列1 <wbr>- <wbr>综述

主要包含四个子atom，movie header atom(mvhd), 一个audio track atom(trak)，一个video track atom(trak)。

Dumping sample_100kbit_nohint.mp4 meta-information...
type ftyp
  majorBrand = mp42
  minorVersion = 1 (0x00000001)
  <table entries suppressed>
type moov
  type mvhd
   version = 0 (0x00)
   flags = 0 (0x000000)
   creationTime = 3250080355 (0xc1b84a63)
   modificationTime = 3250080355 (0xc1b84a63)
   timeScale = 600 (0x00000258)
   duration = 42000 (0x0000a410)
   rate = 1.000000
   volume = 1.000000
   reserved1 = <70 bytes>
   00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00
   00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00
   00 00 00 00 00 00 00 00 00 00 40 00 00 00 00 00
   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
   00 00 00 00 00 00
   nextTrackId = 3 (0x00000003)
  type trak
   type tkhd
    version = 0 (0x00)
    flags = 1 (0x000001)
    creationTime = 3250080339 (0xc1b84a53)
    modificationTime = 3250080355 (0xc1b84a63)
    trackId = 1 (0x00000001)
    reserved1 = <4 bytes> 00 00 00 00
    duration = 42000 (0x0000a410)
    reserved2 = <12 bytes> 00 00 00 00 00 00 00 00 00 00 00 00
    volume = 1.000000
    reserved3 = <38 bytes>
    00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00
    00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00
    00 00 40 00 00 00
    width = 0.000000
    height = 0.000000
   type edts
    type elst
     version = 0 (0x00)
     flags = 0 (0x000000)
     entryCount = 1 (0x00000001)
     <table entries suppressed>
   type mdia
    type mdhd
     version = 0 (0x00)
     flags = 0 (0x000000)
     creationTime = 3250080355 (0xc1b84a63)
     modificationTime = 3250080355 (0xc1b84a63)
     timeScale = 8000 (0x00001f40)
     duration = 560128 (0x00088c00)
     language = 5575 (0x15c7)
     reserved = <2 bytes> 00 00
    type hdlr
     version = 0 (0x00)
     flags = 0 (0x000000)
     reserved1 = <4 bytes> 00 00 00 00
     handlerType = soun
     reserved2 = <12 bytes> 00 00 00 00 00 00 00 00 00 00 00 00
     name = 苹果声频媒体处理程序
    type minf
     type smhd
      version = 0 (0x00)
      flags = 0 (0x000000)
      reserved = <4 bytes> 00 00 00 00
     type dinf
      type dref
       version = 0 (0x00)
       flags = 0 (0x000000)
       entryCount = 1 (0x00000001)
       type url
        version = 0 (0x00)
        flags = 1 (0x000001)
        location = (null)
     type stbl
      type stsd
       version = 0 (0x00)
       flags = 0 (0x000000)
       entryCount = 1 (0x00000001)
       type mp4a
        reserved1 = <6 bytes> 00 00 00 00 00 00
        dataReferenceIndex = 1 (0x0001)
        soundVersion = 0 (0x0000)
        reserved2 = <6 bytes> 00 00 00 00 00 00
        channels = 2 (0x0002)
        sampleSize = 16 (0x0010)
        packetSize = 0 (0x0000)
        timeScale = 8000 (0x00001f40)
        reserved3 = <2 bytes> 00 00
        type esds
         version = 0 (0x00)
         flags = 0 (0x000000)
         ESID = 0 (0x0000)
         streamDependenceFlag = 0 (0x0) <1 bits>
         URLFlag = 0 (0x0) <1 bits>
         OCRstreamFlag = 0 (0x0) <1 bits>
         streamPriority = 0 (0x00) <5 bits>
         decConfigDescr
          objectTypeId = 64 (0x40)
          streamType = 5 (0x05) <6 bits>
          upStream = 0 (0x0) <1 bits>
          reserved = 1 (0x1) <1 bits>
          bufferSizeDB = 6144 (0x001800) <24 bits>
          maxBitrate = 20000 (0x00004e20)
          avgBitrate = 20000 (0x00004e20)
          decSpecificInfo
           info = <2 bytes> 15 90
          profileLevelIndicationIndexDescr
         slConfigDescr
          predefined = 2 (0x02)
         ipiPtr
         ipIds
         ipmpDescrPtr
         langDescr
         qosDescr
         regDescr
         extDescr
      type stts
       version = 0 (0x00)
       flags = 0 (0x000000)
       entryCount = 1 (0x00000001)
       <table entries suppressed>
      type stsc
       version = 0 (0x00)
       flags = 0 (0x000000)
       entryCount = 33 (0x00000021)
       <table entries suppressed>
      type stsz
       version = 0 (0x00)
       flags = 0 (0x000000)
       sampleSize = 0 (0x00000000)
       sampleCount = 547 (0x00000223)
       <table entries suppressed>
      type stco
       version = 0 (0x00)
       flags = 0 (0x000000)
       entryCount = 282 (0x0000011a)
       <table entries suppressed>
  type trak
   type tkhd
    version = 0 (0x00)
    flags = 1 (0x000001)
    creationTime = 3250080340 (0xc1b84a54)
    modificationTime = 3250080355 (0xc1b84a63)
    trackId = 2 (0x00000002)
    reserved1 = <4 bytes> 00 00 00 00
    duration = 42000 (0x0000a410)
    reserved2 = <12 bytes> 00 00 00 00 00 00 00 00 00 00 00 00
    volume = 0.000000
    reserved3 = <38 bytes>
    00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00
    00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00
    00 00 40 00 00 00
    width = 192.000000
    height = 242.000000
   type tapt
    data = <60 bytes>
    00 00 00 14 63 6c 65             66 00 00 00 00       00 c0 00 00
    00 f2 00 00 00 00 00 14 70 72 6f 66 00 00 00 00
    00 c0 00 00 00 f2 00 00 00 00 00 14 65 6e 6f 66
    00 00 00 00 00 c0 00 00 00 f2 00 00
   type edts
    type elst
     version = 0 (0x00)
     flags = 0 (0x000000)
     entryCount = 1 (0x00000001)
     <table entries suppressed>
   type mdia
    type mdhd
     version = 0 (0x00)
     flags = 0 (0x000000)
     creationTime = 3250080355 (0xc1b84a63)
     modificationTime = 3250080355 (0xc1b84a63)
     timeScale = 600 (0x00000258)
     duration = 42000 (0x0000a410)
     language = 5575 (0x15c7)
     reserved = <2 bytes> 00 00
    type hdlr
     version = 0 (0x00)
     flags = 0 (0x000000)
     reserved1 = <4 bytes> 00 00 00 00
     handlerType = vide
     reserved2 = <12 bytes> 00 00 00 00 00 00 00 00 00 00 00 00
     name = 苹果视频媒体处理程序
    type minf
     type vmhd
      version = 0 (0x00)
      flags = 1 (0x000001)
      reserved = <8 bytes> 00 00 00 00 00 00 00 00
     type dinf
      type dref
       version = 0 (0x00)
       flags = 0 (0x000000)
       entryCount = 1 (0x00000001)
       type url
        version = 0 (0x00)
        flags = 1 (0x000001)
        location = (null)
     type stbl
      type stsd
       version = 0 (0x00)
       flags = 0 (0x000000)
       entryCount = 1 (0x00000001)
       type mp4v
        reserved1 = <6 bytes> 00 00 00 00 00 00
        dataReferenceIndex = 1 (0x0001)
        reserved2 = <16 bytes> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        width = 192 (0x00c0)
        height = 242 (0x00f2)
        reserved3 = <14 bytes> 00 48 00 00 00 48 00 00 00 00 00 00 00 01
        compressorName =
        reserved4 = <4 bytes> 00 18 ff ff
        type esds
         version = 0 (0x00)
         flags = 0 (0x000000)
         ESID = 0 (0x0000)
         streamDependenceFlag = 0 (0x0) <1 bits>
         URLFlag = 0 (0x0) <1 bits>
         OCRstreamFlag = 0 (0x0) <1 bits>
         streamPriority = 31 (0x1f) <5 bits>
         decConfigDescr
          objectTypeId = 32 (0x20)
          streamType = 4 (0x04) <6 bits>
          upStream = 0 (0x0) <1 bits>
          reserved = 1 (0x1) <1 bits>
          bufferSizeDB = 28125 (0x006ddd) <24 bits>
          maxBitrate = 75000 (0x000124f8)
          avgBitrate = 75000 (0x000124f8)
          decSpecificInfo
           info = <32 bytes>
           00 00 01 b0 f0 00 00 01 b5 0e e0 40 c0 cf 00 00
           01 00 00 00 01 20 00 84 40 fa 28 30 20 f2 a2 1f
          profileLevelIndicationIndexDescr
         slConfigDescr
          predefined = 2 (0x02)
         ipiPtr
         ipIds
         ipmpDescrPtr
         langDescr
         qosDescr
         regDescr
         extDescr
      type stts
       version = 0 (0x00)
       flags = 0 (0x000000)
       entryCount = 1 (0x00000001)
       <table entries suppressed>
      type stss
       version = 0 (0x00)
       flags = 0 (0x000000)
       entryCount = 35 (0x00000023)
       <table entries suppressed>
      type stsc
       version = 0 (0x00)
       flags = 0 (0x000000)
       entryCount = 140 (0x0000008c)
       <table entries suppressed>
      type stsz
       version = 0 (0x00)
       flags = 0 (0x000000)
       sampleSize = 0 (0x00000000)
       sampleCount = 1050 (0x0000041a)
       <table entries suppressed>
      type stco
       version = 0 (0x00)
       flags = 0 (0x000000)
       entryCount = 280 (0x00000118)
       <table entries suppressed>
type free
type free
type mdat

比特率就是文件长度除以时长，比如源文件是17 MiB，时长是4分43秒，先将MiB转换成b，17(MiB)*1024*1024*8 = 14260633(b)，随后将时间转换成秒 4分43秒 = 283秒，最后用文件长比上时长 (142606336 / 283) * 1000 = 503.909..四舍五入后约等于 504(Kbps)，这里需要注意两点一个就是MiB和MB的区别，前者是以1024进位，而后者是以1000进位，另一个就是为什么我没有将MiB换算成Kib直接除以283秒计算比特率，原因就在于网络传输一般都是以1000进位的，如果不彻底将文件长度从MiB换算成b再计算的话，就会因为Kib中的K是以1024进位，而Kbps是中的K是以1000进位的原因而影响到最后计算出来的比特率的准确性。

当播放一部电影或者一个track的时候，对应的media handler必须能够正确的解析数据流，对一定的时间获取对应的媒体数据。如果是视频媒体， media handler可能会解析多个atom，才能找到给定时间的sample的大小和位置。具体步骤如下：

1．确定时间，相对于媒体时间坐标系统

2．检查time-to-sample atom来确定给定时间的sample序号。

3．检查sample-to-chunk atom来发现对应该sample的chunk。

4．从chunk offset atom中提取该trunk的偏移量。

5．利用sample size atom找到sample在trunk内的偏移量和sample的大小。

例如，如果要找第1秒的视频数据，过程如下：

1．第1秒的视频数据相对于此电影的时间为600

2．检查time-to-sample atom，得出每个sample的duration是40，从而得出需要寻找第600/40 = 15 + 1 = 16个sample

3．检查sample-to-chunk atom，得到该sample属于第5个chunk的第一个sample，该chunk共有4个sample

4．检查chunk offset atom找到第5个trunk的偏移量是20472

5．由于第16个sample是第5个trunk的第一个sample，所以不用检查sample size atom，trunk的偏移量即是该sample的偏移量20472。如果是这个trunk的第二个sample，则从sample size atom中找到该trunk的前一个sample的大小，然后加上偏移量即可得到实际位置。

6．得到位置后，即可取出相应数据进行解码，播放

查找过程与查找sample的过程非常类似，只是需要利用sync sample atom来确定key frame的sample序号

确定给定时间的sample序号
检查sync sample atom来发现这个sample序号之后的key frame
检查sample-to-chunk atom来发现对应该sample的chunk
从chunk offset atom中提取该trunk的偏移量
利用sample size atom找到sample在trunk内的偏移量和sample的大小

Seeking主要是利用sample table box里面包含的子box来实现的，还需要考虑edit list的影响。

可以按照以下步骤seek某一个track到某个时间T，注意这个T是以movie header box里定义的time scale为单位的：

如果track有一个edit list，遍历所有的edit，找到T落在哪个edit里面。将Edit的开始时间变换为以movie time scale为单位，得到EST，T减去EST，得到T'，就是在这个edit里面的duration，注意此时T'是以movie的time scale为单位的。然后将T'转化成track媒体的time scale，得到T''。T''与Edit的开始时间相加得到以track媒体的time scale为单位的时间点T'''。
这个track的time-to-sample表说明了该track中每个sample对应的时间信息，利用这个表就可以得到T'''对应的sample N_T。
sample N_T可能不是一个random access point，这样就需要其他表的帮助来找到最近的random access point。一个表是sync sample表，定义哪些sample是random access point。使用这个表就可以找到指定时间点最近的sync sample。如果没有这个表，就说明所有的sample都是synchronization points，问题就变得更容易了。另一个shadow sync box可以帮助内容作者定义一些特殊的samples，它们不用在网络中传输，但是可以作为额外的random access point。这就改进了random access，同时不会影响正常的传输比特率。这个表指出了非random access point和random access point之间的关系。如果要寻找指定sample之前最近的shadow sync sample，就需要查询这个表。总之，利用sync sample和shadow sync表，就可以seek到N_T之前的最近的access point sample N_ap。
找到用于access point的sample N_ap之后，利用sample-to-chunk表来确定sample位于哪个chunk内。
找到chunk后，使用chunk offset找到这个chunk的开始位置。
使用sample-to-chunk表和sample size表中的数据，找到N_ap在此chunk内的位置，再加上此chunk的开始位置，就找到了N_ap在文件中的位置。