ffmpeg threads,为什么切片线程使用ffmpeg x264对实时编码有太大的影响?

本文探讨了使用ffmpeg和libx264进行实时屏幕编码时遇到的问题。通过禁用切片线程参数,将每帧编码时间从12毫秒减少到2毫秒,并降低了CPU使用率。文章详细解释了帧级线程与切片级线程的区别及其在实时编码场景中的应用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

I'm using ffmpeg libx264 to encode a 720p screen captured from x11 in realtime with a fps of 30.

when I use -tune zerolatency paramenter, the average encode time per-frame can be as large as 12ms with profile baseline.

After a study of the ffmpeg x264 source code, I found that the key parameter leading to such long encode time is sliced-threads which enabled by -tune zerolatency. After disabled using -x264-params sliced-threads=0 the encode time can be as low as 2ms

And with sliced-threads disabled, the CPU usage will be 40%, while only 20% when enabled.

Can someone explain the details about this sliced-thread? Especially in realtime encoding(assume no frame is buffered to be encoded. only encode when a frame is captured).

解决方案

The documentation shows that frame-based threading has better throughput than slice-based. It also notes that the latter doesn't scale well due to parts of the encoder that are serial.

Speedup vs. encoding threads for the veryfast profile (non-realtime):

threads speedup psnr

slice frame slice frame

x264 --preset veryfast --tune psnr --crf 30

1: 1.00x 1.00x +0.000 +0.000

2: 1.41x 2.29x -0.005 -0.002

3: 1.70x 3.65x -0.035 +0.000

4: 1.96x 3.97x -0.029 -0.001

5: 2.10x 3.98x -0.047 -0.002

6: 2.29x 3.97x -0.060 +0.001

7: 2.36x 3.98x -0.057 -0.001

8: 2.43x 3.98x -0.067 -0.001

9: 3.96x +0.000

10: 3.99x +0.000

11: 4.00x +0.001

12: 4.00x +0.001

The main difference seems to be that frame threading adds frame latency as is needs different frames to work on, while in the case of slice-based threading all threads work on the same frame. In realtime encoding it would need to wait for more frames to arrive to fill the pipeline as opposed to offline.

Normal threading, also known as frame-based threading, uses a clever staggered-frame system for parallelism. But it comes at a cost: as mentioned earlier, every extra thread requires one more frame of latency. Slice-based threading has no such issue: every frame is split into slices, each slice encoded on one core, and then the result slapped together to make the final frame. Its maximum efficiency is much lower for a variety of reasons, but it allows at least some parallelism without an increase in latency.

Sliceless threading: example with 2 threads.

Start encoding frame #0. When it's half done, start encoding frame #1. Thread #1 now only has access to the top half of its reference frame, since the rest hasn't been encoded yet. So it has to restrict the motion search range. But that's probably ok (unless you use lots of threads on a small frame), since it's pretty rare to have such long vertical motion vectors. After a little while, both threads have encoded one row of macroblocks, so thread #1 still gets to use motion range = +/- 1/2 frame height. Later yet, thread #0 finishes frame #0, and moves on to frame #2. Thread #0 now gets motion restrictions, and thread #1 is unrestricted.

Therefore it makes sense to enable sliced-threads with -tune zereolatency as you need to send a frame as soon as possible rather then encode them efficiently (performance and quality wise).

Using too many threads on the contrary can impact performance as the overhead to maintain them can exceed the potential gains.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值