Chapter 1 - syncing audio and video

Be the player - a young JS developer writing a new MSE video player.

Before we move to code a transcoding example let’s talk about timing, or how a video player knows the right time to play a frame.

In the last example, we saved some frames that can be seen here:

frame 0 frame 1 frame 2 frame 3 frame 4 frame 5

When we’re designing a video player we need to play each frame at a given pace, otherwise it would be hard to pleasantly see the video either because it’s playing so fast or so slow.

Therefore we need to introduce some logic to play each frame smoothly. For that matter, each frame has a presentation timestamp (PTS) which is an increasing number factored in a timebase that is a rational number (where the denominator is known as timescale) divisible by the frame rate (fps).

It’s easier to understand when we look at some examples, let’s simulate some scenarios.

For a fps=60/1 and timebase=1/60000 each PTS will increase timescale / fps = 1000 therefore the PTS real time for each frame could be (supposing it started at 0):

  • frame=0, PTS = 0, PTS_TIME = 0
  • frame=1, PTS = 1000, PTS_TIME = PTS * timebase = 0.016
  • frame=2, PTS = 2000, PTS_TIME = PTS * timebase = 0.033

For almost the same scenario but with a timebase equal to 1/60.

  • frame=0, PTS = 0, PTS_TIME = 0
  • frame=1, PTS = 1, PTS_TIME = PTS * timebase = 0.016
  • frame=2, PTS = 2, PTS_TIME = PTS * timebase = 0.033
  • frame=3, PTS = 3, PTS_TIME = PTS * timebase = 0.050

For a fps=25/1 and timebase=1/75 each PTS will increase timescale / fps = 3 and the PTS time could be:

  • frame=0, PTS = 0, PTS_TIME = 0
  • frame=1, PTS = 3, PTS_TIME = PTS * timebase = 0.04
  • frame=2, PTS = 6, PTS_TIME = PTS * timebase = 0.08
  • frame=3, PTS = 9, PTS_TIME = PTS * timebase = 0.12
  • frame=24, PTS = 72, PTS_TIME = PTS * timebase = 0.96
  • frame=4064, PTS = 12192, PTS_TIME = PTS * timebase = 162.56

Now with the pts_time we can find a way to render this synched with audio pts_time or with a system clock. The FFmpeg libav provides these info through its API:

Just out of curiosity, the frames we saved were sent in a DTS order (frames: 1,6,4,2,3,5) but played at a PTS order (frames: 1,2,3,4,5). Also, notice how cheap are B-Frames in comparison to P or I-Frames.

  1. LOG: AVStream->r_frame_rate 60/1
  2. LOG: AVStream->time_base 1/60000
  3. ...
  4. LOG: Frame 1 (type=I, size=153797 bytes) pts 6000 key_frame 1 [DTS 0]
  5. LOG: Frame 2 (type=B, size=8117 bytes) pts 7000 key_frame 0 [DTS 3]
  6. LOG: Frame 3 (type=B, size=8226 bytes) pts 8000 key_frame 0 [DTS 4]
  7. LOG: Frame 4 (type=B, size=17699 bytes) pts 9000 key_frame 0 [DTS 2]
  8. LOG: Frame 5 (type=B, size=6253 bytes) pts 10000 key_frame 0 [DTS 5]
  9. LOG: Frame 6 (type=P, size=34992 bytes) pts 11000 key_frame 0 [DTS 1]