Optimized Video Encoding with FFmpeg on AWS Graviton ProcessorsNovember 15, 2022
If you have not tried video encoding on Graviton lately, now is the time to give it another look. Recent FFmpeg improvements, contributed by AWS and others in the open source community, have increased the performance of fully loaded video workloads on Graviton processors.
Measured on Amazon Elastic Compute Cloud (Amazon EC2) C7g instances, for offline video encoding we saw a 63% performance boost for H.264 and 60% for H.265. Encoding video on C7g costs measured 29% less for H.264 and 18% less for H.265 compared to C6i, the latest x86-based Amazon EC2 instance (both using on-demand instance pricing). This makes C7g the fastest compute optimized cloud instance that is the most cost effective and the most energy efficient for video encoding.
When the AWS Graviton2 instances were introduced, they provided 40% better price performance for many workloads, compared to similar x86 Amazon EC2 instances. Graviton3 features an additional 25% improved performance over Graviton2. Video processing and transcoding has been growing in importance, and Graviton is well suited for this workload. AWS engineers and the open source community have worked on video encoding tools, such as FFmpeg and the codec libraries, to further optimize for Graviton. You can get these improvements on GitHub from a build in the development branch of FFmpeg, or use FFmpeg version 5.2 when it is released.
One of the common use cases for video in the cloud is batch transcoding multiple videos concurrently on the same instance. This optimizes for the best throughput and price. Another popular use case is transcoding a single input stream to multiple output formats optimized for different viewing resolutions. Both of these cases require optimizing performance for concurrent processing. For the following benchmarks we scale down the incoming 4k stream and encode multiple target resolutions for each input. Each different target resolution can be used to support different device and network capabilities at their native resolution: 1080p, 720p, 480p, 360p, and 160p.
We tested encoding the target videos into H.264 and H.265 using the x264 and x265 open source libraries. The H.264 or AVC (Advanced Video Coding) standard was first published in 2004 and enjoys broad compatibility. Devices including mobile phones, tablets, personal computers, smart TVs, and others generally have support for hardware accelerated H.264 decoding. The H.265 or HEVC (High Efficiency Video Coding) standard was first published in 2013 and has better compression at a given level of quality than H.264, but hardware accelerated decoding is not as widely deployed and patents and licensing restrictions have prevented some companies from adopting it in their software. For most video use cases, having more than one video format will be necessary in order to provide the best quality for devices which can play H.265 and also H.264 for devices without H.265 decoding support.
Offline (batch) encoding
Speed: The following diagram shows the encoding speed in frames per second (FPS) for a sample workload. It was tested comparing FFmpeg 4.2 with the development branches of FFmpeg and x265 that include the latest optimizations.
Cost: The cost of encoding on the latest Graviton instance, C7g, is compared with the latest Amazon EC2 x86 based instances, C6i and C6a, showing better performance and a reduction of 18-29% in cost compared to C6i.
Lower is better. Normalized so that cost of x264, preset ultrafast on c6i is equal to one.
The results show the total cost to transcode 1 million input frames in parallel jobs to five output sizes. Each value is a mean of results for three different input files tested. 1 million frames is about 4 hours and 37 minutes at 60 frames per second.
Live stream encoding
For a live streaming use case, we can measure the maximum number of streams for which an instance can maintain full frame rate while transcoding to 3 output sizes. The results below are the number of streams the instance was able to sustain divided by the cost per hour, resulting in 15-35% lower overall cost on C7g vs. C6i. This makes the C7g instance the most cost effective AWS compute instance type for transcoding streaming video.
The aarch64 version of the scaling functions initially used the reference implementations written in C. After rewriting these C functions in aarch64 assembly, the performance improved significantly. Video scaling is a component of FFmpeg which consistently takes a high percentage of compute time; most encode jobs will include a scaling step, since it is necessary to create multiple outputs to support different device resolutions, both for offline and live streams. All of these changes have been contributed upstream into FFmpeg. See the table below for some of the changes AWS contributed since the 2019 release of FFmpeg version 4.2. In Figure 6, below, are the sample code changes and their effects on the encoding performance on Graviton.
|Function name||Speed up||Commit|
Through a series of optimizations to the horizontal and vertical scaling functions, as detailed in the pull requests listed here, AWS engineers were able to improve performance for a variety of input cases. After optimizations optimizations and others applied to FFmpeg and to x265, Graviton instances perform better than comparable Amazon EC2 x86 based instances. Comparing C7g instances to C6i instances for the mainline branch of FFmpeg, C7g shows higher performance in every category.
To benchmark FFmpeg we used three different test files, each 10 seconds long. One was a high bitrate test with complex motion and lots of high frequency detail changes, another was mostly a still scene and a low bitrate, and a third was a moderate bitrate scene from the open source Tears of Steel film. We transcoded each clip into the five target sizes using multiple parallel jobs intended to simulate a service transcoding many sources in parallel. To increase the stability of the measurements, we also executed multiple iterations of these parallel jobs sequentially. The total time to execute these jobs is then used to calculate frames per second and cost per frame. Results are measured in frames per second and use the number of source frames transcoded, rather than the output frames, since the output consists of many different sizes. All input files are 4K in size and had H.264 encoding. We tested with the following software versions: FFmpeg, 2022-08-23; x264, 2022-06-01; x265, 2022-09-12.
Graviton2 and Graviton3 processors are cost efficient and fast for running video transcoding. With the latest improvements to FFmpeg and codecs, the advantage has only improved. In order to achieve these results for yourself, the first step is to ensure you are running an optimized build from the latest code. There’s a pre-built binary on https://github.com/BtbN/FFmpeg-Builds/releases, a third-party which maintains builds using the latest source code. VT1 and GPU instances can also be a compelling option, especially for live video, but have less flexibility for getting the best quality at a given bit rate than software encoders. If a software encoder is right for your workload, Graviton is a great option.
There is still more work to do for FFmpeg, especially if you are using HDR content with 10 or 12 bit color depth. If you are, and even if you are not, be sure to keep up to date with FFmpeg and codec releases. If you find use cases where FFmpeg on Graviton does not meet expectations, please open an issue on the Graviton Technical Guide to let us know about it. We will continue to add more performance improvements to make Graviton the most cost effective and efficient general purpose processor for video encoding.