I once asked on #ffmpeg@libera if the GPU could be used to encode h264, and apparently yes, but it's not really worth it compared to CPU.
I don't know much about video compression, does that mean that a codec like h264 is not parallelizable?
> the GPU could be used to encode h264, and apparently yes, but it's not really worth it compared to CPU.
It depends on what you're going for. If you're trying to do the absolute highest fidelity for archiving a blu-ray disk, AMD Epyc reigns supreme. That's because you need a lot of flexibility to really dial in the quality settings. Pirates over at PassThePopcorn obsess over minute differences in quality that I absolutely cannot notice with my eyes, and I'm glad they do! Their encodings look gorgeous. This quality can't be achieved with the silicon of hardware-accelerated encoders, and due to driver limitations (not silicon limitations) also cannot be achieved by CUDA cores / execution engines / etc on GPUs.
But if you're okay with a small amount of quality loss, the optimum move for highest # of simultaneous encodes or fastest FPS encoding is to skip the CPU and GPU "general compute" entirely - going with hardware accelerated encoding can get you 8-30 1080p simultaneous encodes on a very cheap intel iGPU using QSV/VAAPI encoding. This means using special sections of silicon whose sole purpose is to perform H264/H265/etc encoding, or cropping / scaling / color adjustments ... the "hardware accelerators" I'm talking about are generally present in the CPU/iGPU/GPU/SOC, but are not general purpose - they can't be used for CUDA/ROCm/etc. Either they're being used for your video pipeline specifically, or they're not being used at all.
I'm doing this now for my startup and we've tuned it so it uses 0% of the CPU and 0% of the Render/3D engine of the iGPU (which is the most "general purpose" section of the GPU, leaving those completely free for ML models) and only utilizing the Video Engine and Video Enhance engines.
For something like Frigate NVR, that's perfect. You can support a large # of cameras on cheap hardware and your encoding/streaming tasks don't load any silicon used for YOLO, other than adding to overall thermal limits.
Video encoding is a very deep topic. You need to have benchmarks, you need to understand not just "CPU vs GPU" ... but down to which parts of the GPU you're using. There's an incredible amount of optimization you can do for your specific task if you take the time to truly understand the systems level of your video pipeline.
Common video codecs are often hardware accelerated. This should be on the CPU side quite often, as there are a lot of systems without dedicated GPUs that still play video, like Notebooks and smart phones. So in the end it's less about being parallelizable, but if it beats dedicated hardware, to which the answer should almost always be no.
P.S.: In video decoding speed is only relevant up to a certain point. That being: "Can I decode the next frame(s) in time to show it/them without stuttering". Once that has been achieved other factors such as power drainage become more important.
I think it's mostly because most cpus that can run a gpu already have parts dedicated as h264 encoder, way more efficient energy wise and speed wise.
This is literally what the article is about. It answers your questions.
A GPU's job is to take inputs at some resolution, transform it, and then output it at that resolution. H.264/H.265 (and really, any playback format) needs a fundamentally different workflow: it needs to take as many frames as your framerate is set to, store the first frame as a full frame, and then store N-1 diffs, only describing which pixels changed between each successive frame. Something GPUs are terrible at. You could certainly use the GPU to calculate the full frame diff, but then you still need to send it back to the CPU or dedicated encoding hardware that turns that into an actual concise diff description. At that point, you might as well make the CPU or hardware encoder do the whole job, you're just not saving any appreciable time by sending the data over to the GPU first, just to get it back in a way where you're still going over every pixel afterwards.
[dead]
One of the choke points of all modern video codecs that focus on potential high compression ratios is the arithmetic entropy coding. CABAC for h264 and h265, 16-symbol arithmetic coding for AV1. There is no way to parallelize that AFAIK: the next symbol depends on the previous one. All you can do is a bit of speculative decoding but that doesn’t go very deep. Even when implemented in hardware, the arithmetic decoding is hard to parallelize.
This is especially a choke point when you use these codecs for high quality settings. The prediction and filtering steps later in the decoding pipeline are relatively easy to parallelize.
High throughput CODECs like ProRes don’t use arithmetic coding but a much simpler, table based, coding scheme.