Academia.eduAcademia.edu
Detailed Comparative Analysis of VP8 and H.264 Yousef O. Sharrab and Nabil J. Sarhan Electrical and Computer Engineering Department & Wayne State Media Research Lab Wayne State University Detroit, MI 48202 Email:{yousef.sharrab, nabil}@wayne.edu Abstract—VP8 has recently been offered by Google as an open video compression format in attempt to compete with the widely used H.264 video compression standard. This paper describes the major differences between VP8 and H.264 and provides detailed comparative evaluations through extensive experiments. We use 29 raw video sequences, offering a wide spectrum of resolutions and content characteristics, with the resolution ranging from 176x144 (QCIF) to 3840x2160 (2160p). To ensure a fair study, we use 3 coding presets in H.264, each with three types of tuning, and 7 presets in VP8. The presets cover a variety of achieved quality or complexity levels. The performance metrics include accuracy of bitrate handling, encoding speed, decoding speed, and perceptual video quality. Keywords-Comparative Analysis, Decoding Speed, Encoding Speed, H.264, Perceptual Video Quality, Video Codecs, Video Compression, VP8. I. I NTRODUCTION The H.264 video compression standard currently enjoys a great support by video websites, applications, and hardware platforms and devices, including TVs, smart phones, and digital cameras. In addition, it has numerous popular implementations, including JM, X264, and FFmpeg. The widespread of H.264 has recently faced a great challenge with Google releasing VP8 as royalty-free open video compression format in an attempt to compete with H.264 and gradually replace its usage on YouTube [1]. The effectiveness of VP8 compared with H.264 will be one of the deciding factors that will determine whether VP8 will be the video compression of choice in the future. With the great overreach of video compression, it is imperative to rigorously evaluate the effectiveness of VP8 and compare it with H.264. Unfortunately, only little work compared the effectiveness of H.264 and VP8 [2], [3]. That work is also highly limited. A commercial report [2] has recently compared the performance of VP8 and H.264. Since most of the used video sequences were previously compressed using other codecs, they cannot be used reliably to draw any conclusions because of the inevitable bias introduced in re-compression tests. Study [3] has compared VP8 and H.264 in terms of only video quality, using only one metric, only three CIF sequences (with 352x288 resolution), and basic encoding settings. This paper describes the major differences between VP8 and H.264 in features and operation. It also provides detailed comparative evaluations by more than 1, 300 experiments, so as This work was supported in part by U.S. NSF grant CNS-0834537. to reflect real-life situations by carefully choosing the encoding parameters, the video test sequences, and the proper metrics. We use 29 raw video sequences, offering a wide spectrum of resolutions and content characteristics, with the resolution ranging from 176x144 (QCIF) to 3840x2160 (2160p) and the content varying greatly in the level of detail and motion speeds. To ensure a fair study, we use 3 coding presets in H.264, each with three types of metric tuning, and 7 presets in VP8. These presets cover a variety of achieved quality or complexity levels. The bitrate for each sequence is varied in a wide range that is suitable for that sequence. For H.264, we use X264 , which is the best implementation, according to the results in [2]. The performance metrics include perceptual video quality, encoding speed, decoding speed, and accuracy of bitrate handling. For perceptual video quality, we use two metrics: Peak Signal-to-Noise Ratio (PSNR) and the Structural SIMilarity index (SSIM) [4]. The experiments with metric tuning in H.264 are “No Tuning”, “SSIM Tuning”, and “PSNR Tuning”. Although we discuss all the results, the conclusions are based on “No Tuning” as the other two options may be unfair to VP8. The rest of the paper is organized as follows. Section II provides preliminary analysis of H.264 and VP8. Subsequently, Section III discusses the performance evaluation methodology. Finally, Section IV presents and analyzes the main results. II. P RELIMINARY A NALYSIS Video data contains spatial and temporal redundancy. Therefore, similarities can be encoded by just considering differences (residuals) within a frame. The first frame of a sequence or a random access point is typically intra-coded. Each block of pixels in an intra-frame is predicted using previouslyencoded neighboring blocks. For all remaining frames of a sequence or between random access points, inter-coding is usually used, employing block motion compensation to predict blocks from other previously decoded frames. The residuals of the intra and inter-prediction are then transformed to the frequency domain using the Integer Discrete Cosine Transform (Integer DCT). Subsequently, the transform coefficients are quantized, thereby reducing the overall precision of the coefficients and possibly eliminating high frequency coefficients. The quantized transform coefficients are entropy coded and transmitted together with any possible motion vectors. In the YUV colorspace, each pixel is represented by three components: Y, U, and V. The Y component determines TABLE I C OMPARING H.264 AND VP8 C HARACTERISTICS Encoding H.264 VP8 Step IntraUses 9 prediction modes for Uses 4 common intraPrediction each 4 × 4 Luma block and prediction modes which for each 8 × 8 Luma block shared by these macroblocks: 4 × 4 Luma, 16 × 16 Luma, in high profiles prediction. Uses 4 prediction modes for and 8 × 8 Chroma. a 16 × 16 Luma block and 8 × 8 Chroma prediction modes InterSupports up to 16 reference Supports up to three referPrediction frames ence frames Partition types are 16 × 16, Partition types are 16 × 16, 16×8, 8×16 and each 8×8 16 × 8, 8 × 16, 8 × 8, 4 × 4 can be 8 × 8, 8 × 4, 4 × 8, or 4 × 4 Chroma motion vector are Chroma motion vectors are calculated directly from calculated from Luma vecLuma motion vectors tors by averaging the motion vectors within a macroblock Interpolation filter uses qpel, Interpolation filter uses qpel, six-tap Luma and bilinear six-tap Luma, and mixed Chroma four/six-tap Chroma Supports B-frames Does not supports B-frames Transform Uses integer DCT Uses DCT Quantization Has a built-in adaptive quan- Adaptive quantization is not tization a core feature Entropy Uses an adaptive arithmetic Uses non-adaptive arithmetic Coding coder coder Loop Filter Its complexity is between Has two modes: fast and norVP8’s fast and normal modes mal the brightness, which is referred to as Luminance or Luma, whereas the U and V components determine the color itself, referred to as Chrominance or Chroma. Since the human eye is more sensitive to luminance, the Chroma component can be sampled at a much lower rate. An H.264 encoder can choose from many different intra and inter modes when coding a macroblock. The Rate-Distortion Optimization (RDO) mode selection is an algorithm for choosing the best coding mode for each macroblock, based on the bitrate and distortion cost. To select the best encoding mode for a macroblock, RDO algorithm examines all possible combinations. The rest of the section compares the characteristics of H.264 and VP8. Intra-Prediction Intra-prediction is used to predict the content of a block from its surroundings without referring to other frames. With H.264, 9 modes are possible for intra-prediction of each 4 × 4 Luma block and for each 8 × 8 Luma block: DC mode and 8 directional modes. In DC mode, each block is predicted by the mean of the pixel samples in the upper row and left column. Directional modes include Vertical, Horizontal, and 6 other angular modes, whereby the samples are extrapolated vertically, horizontally or with a specified angle, respectively. (The Luma component with an 8 × 8 block size is only available in the high profile). In addition, 4 prediction modes are available for each 16 × 16 Luma block (used for regions in the frame with less spatial details), and for each 8 × 8 Chroma block. These modes are DC, Horizontal, Vertical, and Planar. Planar is a linear plane function fitted to the upper and lefthand samples. Four intra-prediction modes of VP8 are used for each one of the three types of macroblocks: 4 × 4 Luma, 16 × 16 Luma, and 8 × 8 Chroma. These modes are DC (DC PRED), Horizontal (H PRED), Vertical (V PRED), and TrueMotion (TM PRED) prediction. TM PRED prediction is similar to Planar in H.264. In intra-mode selection in H.264, the number of mode combinations for one 16×16 pixel MacroBlock (MB) without considering the high profile is N 8×(16×N 4+N 16), where N 8, N 4, and N 16 represent the number of modes of an 8 × 8 Chroma block, a 4 × 4 Luma block, and a 16 × 16 Luma block, respectively. To select the best mode for one MB in intra-prediction, the encoder performs 4 × (16 × 9 + 4) = 592 RDO calculations [5]. For VP8, applying the above equation yields 4 × (16 × 4 + 4) = 272 RDO calculations. Therefore, the complexity of VP8 intra-prediction is less than half that of H.264 without optimization. Inter-Prediction Inter-prediction is the process of estimating the content of a block by referring to encoded frames. The main components of inter-prediction are reference frames and motion vectors. The reference frame is a previously encoded frame used to predict blocks in another frame, whereas the motion vectors indicate the distance between the position of the current block in a frame and the corresponding prediction block in the reference frame [6]. The distortion measure sum of absolute differences (SAD) is commonly used to find the best matching block for the current block in the frame to be encoded from the blocks of previously encoded reference frames. SAD(Vx , Vy ) is defined as the SAD for block A of size P ×Q located at (x, y) inside the current frame compared to block B located at a displacement of (Vx,Vy) relative to block A in the reference frame. It can be found by summing the absolute differences between each pixel in block A and the corresponding pixel in block B. In the Full Search (FS) algorithm, if a maximum displacement of S pixels in a frame is allowed, (2.S + 1)2 locations have to be searched to find the best match for the current block in the reference frame. VP8 supports up to 3 reference frames: previous frame, alternative reference frame, and golden frame. In contrast, H.264 support up to 16 reference frames. Study [7] demonstrate by experiments that using more than three reference frames is exceedingly unlikely to provide a significant benefit in perceptual quality while significantly increasing the implementation complexity and power consumption. VP8 and H.264 have similar partitioning structures. VP8 uses the following pixel partition types: 16 × 16, 16 × 8, 8 × 16, 8 × 8, and 4 × 4. By contrast, H.264 uses 16 × 16, 16 × 8, and 8 × 16, and each 8 × 8 can be further divided into 8 × 8, 8 × 4, 4 × 8, or 4 × 4. Dropping 8 × 4 and 4 × 8 partitions in VP8 is unlikely to be of a significant consequence [7], [6]. For motion vectors, both VP8 and H.264 support variablesize partitions. Both VP8 and H.264 use quarter-pixel precision TABLE III C HARACTERISTICS OF S TANDARD S EQUENCES C ATEGORIES Format Resolutions Category bitrate (Kbps) QCIF 176x144 Video Conferences 20-800 CIF 352x288 4CIF 704x576 SDTV 100-2000 720p 1280x720 HDTV 500-3000 1080p 1920x1080 2160p 3840x2160 Quad HDTV 2000-8000 (Qpel) motion vectors with a six-tap interpolation filter for Luma pixels. Qpel uses interpolation methods to detect a motion more accurately in a video frame by increasing the precision to 1/4 of a pixel. H.264 predicts the Chroma motion vectors directly from the Luma vectors, whereas VP8 predicts them from Luma vectors by averaging the motion vectors within a macroblock. A further distinction is that H.264 uses the simpler bilinear prediction for Chroma, whereas VP8 uses a mixed four/six-tap interpolation. H.264 supports three types of frames: I-frames (intra-coded frames), P-frames (predictive inter-frames), and B-frames (bipredictive inter-frames). P-frames are predicted from previous frames, whereas B-frames are predicted from both previous and future frames, and I-frames are predicted only using intra-prediction. B-frames can achieve up to 20% compression benefit at the expense of implementation complexity [7], [6]. VP8 does not support B-frames, but tries to compensate for that by the intelligent use of both the golden and the alternate reference frames [7]. Transformation and Quantization H.264 uses integer DCT. This transform results in lower compression rates, but greatly simplifies the transform process, which can be implemented by using only additions, subtractions, and right shifts. VP8 uses an accurate version of DCT, which requires many multiplication operations. For quantization, H.264 use a built-in adaptive quantization algorithm that performs macroblock-level quantization to improve video quality. In contrast, VP8 does not use macroblocklevel quantization [6]. Entropy Coding Entropy coding is the process of lossless compression of all the information from all the other encoding processes, such as DCT coefficients, prediction modes, and motion vectors. The arithmetic coder are similar for both VP8 and H.264, but H.264 uses an adaptive arithmetic coder, whereas VP8 uses a non-adaptive arithmetic coder [6]. Loop Filter The loop filter is used to smooth the block edges. VP8 and H.264 use similar loop filters, but the filter in VP8 can be configured in fast mode or a normal mode. The complexity of the filter in H.264 is between the fast and normal modes of VP8. III. P ERFORMANCE E VALUATION M ETHODOLOGY For VP8, we use vpx codec version v0.9.5 for both encoding (using vpxenc), and decoding (using vpxdec). v0.9.0 (ivfenc TABLE V VP8 P RESET PARAMETERS Best Good0 Good1 Good2 Good3 Good4 Good5 -p 2 –i420 -w 176 -h 144 –target-bitrate=18 –fpf=tmp.fpf – threads=4 –best –end-usage=0 –auto-alt-ref=1 -v –minsectionpct=5 –maxsection-pct=800 –lag-in-frames=16 –kf-min-dist=0 –kf-max-dist=999999 –token-parts=2 –static-thresh=0 –minq=0 –max-q=63 same as best, but –good –cpu-used=0 instead of –best same as good0, but –cpu-used=1 –cpu-used=2 –cpu-used=3 –cpu-used=4 –cpu-used=5 for encoding and ivfdec for decoding). For H.264, we use X264 (r1688) for encoding and FFmpeg (SUN-r24758) for decoding. We divide the test sequences into four groups based on the resolution as shown in Table III. A description of each of the used sequences is shown in Table II. The performance metrics include accuracy of perceptual video quality, encoding speed, decoding speed, and bitrate handling. Bitrate handling captures the accuracy of the encoder in achieving the desire bitrate. For perceptual video quality, we use two metrics: Peak Signal-to-Noise Ratio (PSNR) and the Structural SIMilarity index (SSIM) [4]. The P SN R between an original frame A and the corresponding encoded frame B can be given as follows: M AX 2 , (1) M SE where M SE and M AX represent the Mean-Square Error, and the maximum possible pixel value of the image, respectively. When each pixel is represented as 8 bits, M AX = 255. M SE can be given by P SN R(dB) = 10 × log M SE = m n X X (Aij − Bij )2 , n×m i=1 i=1 (2) where m and n represent the width of the image in pixels and the height of the image in pixels, respectively. Despite the widespread usage of PSNR, the non-linear behavior of the human visual system causes it to not capture accurately the perceptual video quality. SSIM, however, is more correlated to the quality. SSIM compares the distortion in three image components: luminance, contrast, and structure. SSIM (x, y) = (2MA MB + C1 )(2SAB + C2 ) 2 + S2 + C ) , (MA2 + MB2 + C1 )(SA 2 B (3) where MA, MB, SA, SB, and SAB are the average of A, the average of B, the variance of A, the variance of B, and the covariance of A and B, respectively. C1, C2 are two constants to stabilize the division with weak denominator. Since the human eye is more responsive to brightness than to color, we only use the Luma (Y) components in the YUV colorspace to determine both PSNR and SSIM. The raw video sequences and decoded videos are all made in the YV12 format. Videos with the y4m format are converted to the YV12 format using FFmpeg. We develop Java scripts to automate the measurements of the encoding and decoding times and the bitrate of the encoded video. Sequence Name Forman Salesman News Mobile Highway Stefan Paris MotherDaughter City Crew Harbour Soccer Ice DucksTakeOff InToTree OldTownCross ParkJoy Mobcal BlueSky PedestrianArea Riverbed RushHour Station2 Dur. (sec.) 12 17 12 12 80 3 42 12 24 24 24 24 19 20 5 5 20 20 8 15 10 20 12 # Frames 300 449 300 300 2000 90 1065 300 600 600 600 600 480 500 500 500 500 504 645 375 250 500 313 TABLE II C HARACTERISTICS OF THE U SED S TANDARD V IDEO S EQUENCES Resolution Description QCIF, CIF QCIF QCIF, CIF QCIF QCIF CIF CIF CIF 4CIF 4CIF 4CIF 4CIF 4CIF 720p, 2160p 720p, 2160p 720p, 2160p 720p, 2160p 720p 1080p 1080p 1080p 1080p 1080p A talking man face with very rich details, followed by construction A sitting man engaged in moderate gestures without much movements A sitting two television announcers, with a television in the back with moving pictures A moving calendar with text and a detailed photo of a ship A high motion video of a highway A high movement player, with many viewers and much details A man and a woman sitting on a table and talking A mother and her daughter speaking (similar to video conferences) A part of a city that has many crowded buildings, which move due to moving camera 10’s of men walking and waiving by their hands, most of them wear the same color Moving ships in two directions on a harbour A soccer game with players running in different directions A people skiing on the ice without much details in the background Ducks take off from the sea, without much details on the backgrounds or the flying ducks A building, a lake, and many trees, with all the scene moving due to moving camera An Old City Building with much details moving due to moving camera Seven people running on a park with much moving details A moving calendar with text and a detailed photo of a ship A moving picture of a big tree with sky as background Many people crossing a street in two directions A river which shows the movement of a clean water where we can see the stones Many cars in both directions of a street, and walking people in a middle of a city A train station with many moving trains TABLE IV X264 P RESET PARAMETERS , (– KEYINT 500 IS COMMON FOR ALL PRESETS ) HD Preset Parameters QCIF, CIF, 4CIF Preset Parameters High Quality Parameters –preset slow –ref 4 –pass 1 –preset slow –pass 1 –preset slow –ref 4 –pass 2 –preset slow –pass 2 –preset slow –ref 4 –tune PSNR –pass 1 –preset slow –tune PSNR –pass 1 –preset slow –ref 4 –tune PSNR –pass 2 –preset slow –tune PSNR –pass 2 –preset slow –ref 4 –tune SSIM –pass 1 –preset slow –tune SSIM –pass 1 –preset slow –ref 4 –tune SSIM –pass 2 –preset slow –tune SSIM –pass 2 High Speed Parameters –preset faster –subme 3 –trellis 0 –weightp 0 –ref 1 –preset fast –ref 1 –preset faster –subme 3 –trellis 0 –weightp 0 –ref 1 –tune PSNR –preset fast –tune PSNR –ref 1 –preset faster –subme 3 –trellis 0 –weightp 0 –ref 1 –tune SSIM –preset fast –tune SSIM –ref 1 Normal Parameters –preset faster –weightp 0 –subme 3 –pass 1 –preset fast –pass 1 –weightp 0 –subme 5 –preset faster –weightp 0 –subme 3 –pass 2 –preset fast –pass 2 –weightp 0 –subme 5 –preset faster –weightp 0 –subme 3 –tune PSNR –pass 1 –preset fast –tune PSNR –pass 1 –weightp 0 –subme 5 –preset faster –weightp 0 –subme 3 –tune PSNR –pass 2 –preset fast –tune PSNR –pass 2 –weightp 0 –subme 5 –preset faster –tune SSIM –weightp 0 –subme 3 –pass 1 –preset fast –tune SSIM –pass 1 –weightp 0 –subme 5 –preset faster –tune SSIM –weightp 0 –subme 3 –pass 2 –preset fast –tune SSIM –pass 2 –weightp 0 –subme 5 The encoding parameters for both X264 and VP8 are chosen to cover high speed, normal quality, and high quality, thereby considering the tradeoff between the speed and quality. The encoding parameters also cover a range of the bitrate suitable for each sequence. The encoding parameters of X264 are chosen based on the recommendation of the developers, as described in [2]. Similarly, the VP8 parameters are chosen from the WebM website. Tables IV and V detail the values of the used parameters. The encoded videos have the extensions mp4 and WebM for H.264 and VP8, respectably. The encoding and decoding speeds are measured in the number of frames per second. The bitrate handling is measured by dividing the target bitrate (which is an input to the encoder) by the accomplished bitrate (which is the bitrate of the encoded video). The accomplished bitrate is assessed by dividing the file size in Kbits by the sequence display time in seconds. Tuning None PSNR SSIM None PSNR SSIM None PSNR SSIM To assist in determining the PSNR and SSIM metrics, we use Matlab scripts. The following computer configuration is used for the main tests: 64-bit Windows 7 Professional, Intel Core 2 Duo CPU E7500 at 2.93 GHz and 4 GB RAM. IV. R ESULT P RESENTATION AND A NALYSIS Let us now analyze the results of comparing VP8 and H.264. To keep the figures less crowded, not all preset results are shown in all figures. Specifically, the results of VP8 presets “Good0”, “Good1”, “Good3”, and “Good4” are shown only in Figure 6. Furthermore, H.264 presets “High Quality PSNR Tuning”, “High Speed PSNR Tuning”, and “Normal PSNR Tuning” are shown only in Figure 7 and “High Quality SSIM Tuning”, “High Speed SSIM Tuning”, and “Normal SSIM Tuning” are shown only in Figures 6 and 7. Perceptual Video Quality Figures 1 and 2 compare VP8 and H.264 in terms of PSNR and SSIM, respectively. H.264 performs better than VP8 in all resolutions up to and including 720P, but VP8 achieves better results for 1080P and 2160P. Interestingly, H.264 preset “High Quality” and VP8 ‘preset ‘Best” (higher computation complexity) perform poorly in terms of both perceptual video quality and encoding speed (discussed next) in HD resolutions. Encoding Speed Figure 3 compare H.264 and VP8 in terms of encoding speed. H.264 is superior to VP8 in most resolutions. This behavior is despite the reduced feature complexity of VP8 (discussed in section II) and can be attributed to the implementation. Decoding Speed Figure 4 compares the decoding speeds of VP8 and H.264. VP8 and H.264 either have close performance or VP8 performs better, especially with preset “Best” at lower bitrates. Bitrate Handling Figure 5 compares H.264 and VP8 in terms of bitrate handling, as measured by the ratio of the target to accomplished bitrate. H.264 achieves significantly better than VP8 in bitrate handling in all resolutions higher than 4CIF. It can achieve the desired bitrates much more accurately for this wide range of resolutions. VP8 performs slightly better only in 4CIF and lower resolutions. Encoding Speed and Bitrate Tradeoff at Fixed Perceptual Quality Figure 6 compares the VP8 and H.264 encoding speeds and accomplished bitrates at fixed perceptual video quality. In all resolutions excluding HD, H.264 achieves higher encoding and lower bitrates than VP8 at the same quality. In HD resolutions, VP8 yields lower bitrates but lower encoding speeds than H.264 at the same quality. Comparing the presets of VP8 under QCIF, CIF, and 4CIF, preset “Best” (the highest computation complexity) achieves the best quality and the worst encoding speed, followed by presets “Good0”, “Good1”, “Good2”, “Good3”, “Good4”, and “Good5”. As expected, preset “Good5” (the lowest computation complexity) produces the worst quality and the highest encoding speed among all the VP8 presets. For HD resolutions, presets “Good1”, “Good2”, and “Good3” perform better in both the quality and encoding speed. For H.264, generally speaking, preset “Higher Quality” achieves the highest quality and the lowest encoding speed, especially for lower resolutions, whereas preset “Higher Speed” performs the worst quality and the best encoding speed, and “Normal Quality” performs in the middle for lower resolutions but the best for HD resolutions. Impact of SSIM and PSNR Tunings in H.264 Figure 7 demonstrates the impact of metric tuning in H.264 encoding. As expected, H.264 performs better in SSIM metric when it is configured to perform SSIM tuning and performs better in the PSNR metric when it is configured with PSNR tuning. The difference is great and makes the comparisons of H.264 with VP8 unfair if H.264 is tuned. Thus, we generally assumed no such tunings in our conclusions. V. C ONCLUSIONS We have compared the implementation complexities of VP8 and H.264, and have showed that VP8 has simpler intra and inter prediction algorithms. Encoding simplicity leads to faster encoding and lower power consumption at the encoder. In addition, we have analyzed and compared the performance of H.264 and VP8 through extensive experiments. The main results can be summarized as follows. • In terms of perceptual video quality, H.264 performs better than VP8 in all resolutions up to and including 720P, but VP8 achieves better results for 1080P and 2160P. • H.264 is superior to VP8 for most resolutions in terms of the encoding speed. This behavior is despite that reduced complexity of VP8 and can be attributed to the implementation. • VP8 performs better than H.264 in decoding speed for certain resolutions. • H.264 achieves significantly better than VP8 in bitrate handling in all resolutions higher than 4CIF. It can achieve the desired bitrates much more accurately for this wide range of resolutions. • Surprisingly, faster encoding presets perform better in terms of both perceptual quality and encoding speed with both H.264 and VP8 under HD resolutions. These results demonstrate that H.264 generally achieves better than VP8 in terms of perceptual video quality, encoding speed, and bitrate handling. The results in terms of encoding speed are surprising considering that VP8 seeks to reduce the implementation complexity by providing more limited features. The gap between the two in terms of encoding speed is being reduced over times due to improvements in the implementation of VP8. R EFERENCES [1] H. Wilson, “Open letter to google: free vp8, and use it on youtube,” 2010-03-12. [2] D. D. Vatolin, D. D. Kulikov, and A. Parshin, “Sixth MPEG-4 AVC/ H.264 video codecs comparison - short version.” [3] P. Seeling, F. H. Fitzek, G. Ertli, A. Pulipaka, and M. Reisslein, “Video network traffic and quality comparison of VP8 and H.264 SVC,” in Proceedings of the 3rd workshop on Mobile video delivery. Conference on Human Factors in Computing Systems, 2010. [4] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE TRANSACTIONS ON IMAGE PROCESSING, vol. 13, no. 4, pp. 600– 612, 2004. [5] J. Kim, D. Kim, and J. Jeong, “Complexity reduction algorithm for intra mode selection in H.264/AVC video coding.” Advanced Concepts for Intelligent Vision Systems (ACIVS’06), vol. 4179, pp. 454–465, 2006. [6] J. Garrett-Glaser, “The first in-depth technical analysis of VP8, http://x264dev.multimedia.cx/archives/377.” [7] J. Bankoski, P. Wilkins, and Y. Xu, “Technical overview of VP8, an open source video codec for the web.” 36 33 44 35 40 38 PSNR(dB) PSNR(dB) PSNR(dB) 42 34 33 36 32 31 32 30 34 200 400 600 31 800 Accomplished Bit Rate (Kbit/s) 400 600 800 800 1000 1200 1400 Accomplished Bit Rate (Kbit/s) (a) QCIF and CIF (b) 4CIF 30.4 PSNR(dB) PSNR(dB) VP8 Best Preset VP8 Good2 Preset VP8 Good5 Preset H.264 High Quality No tuning Preset H.264 High Speed No tuning Preset H.264 Normal No tuning Preset 30.6 35 Accomplished Bit Rate (Kbit/s) (c) 720P 37 36 1000 1200 1400 1600 1800 30.2 30 29.8 34 29.6 33 1500 29.4 2000 2500 3000 5000 Accomplished Bit Rate (Kbit/s) 6000 (d) 1080P Fig. 1. 7000 8000 Accomplished Bit Rate (Kbit/s) 9000 (e) 2160P Comparing VP8 and H.264 in Perception Video Quality Using the PSNR Metric 0.99 0.96 0.96 0.96 SSIM SSIM 0.97 SSIM 0.965 0.97 0.98 0.95 0.95 0.95 0.94 0.94 0.93 200 400 600 800 Accomplished Bit Rate (Kbit/s) 0.93 (a) QCIF and CIF 0.945 500 1000 0.94 800 1000 1200 1400 1600 1800 1500 Accomplished Bit Rate (Kbit/s) Accomplished Bit Rate (Kbit/s) (b) 4CIF (c) 720P VP8 Best Preset VP8 Good2 Preset VP8 Good5 Preset H.264 High Quality No tuning Preset H.264 High Speed No tuning Preset H.264 Normal No tuning Preset 0.98 0.95 0.97 SSIM SSIM 0.97 0.96 0.955 0.96 0.94 0.95 0.93 1500 2000 2500 3000 Accomplished Bit Rate (Kbit/s) (d) 1080P Fig. 2. 5000 6000 7000 8000 Accomplished Bit Rate (Kbit/s) 9000 (e) 2160P Comparing VP8 and H.264 in Perception Video Quality Using the SSIM Metric 80 500 400 300 200 100 0 0 200 400 600 Accomplished Bit Rate (Kbit/s) Encoding Speed (frame/second) 100 Encoding Speed (frame/second) Encoding Speed (frame/second) 600 80 60 40 20 0 0 800 500 1000 1500 Accomplished Bit Rate (Kbit/s) (a) QCIF and CIF 60 40 20 0 0 2000 (b) 4CIF Encoding Speed (frame/second) Encoding Speed (frame/second) 25 20 15 10 5 1000 2000 3000 Accomplished Bit Rate (Kbit/s) 0.3 0.25 0.2 0.15 0.1 0.05 Accomplished Bit Rate (Kbit/s) (e) 2160P Comparing VP8 and H.264 in Encoding Speed 2000 1500 1000 500 200 400 600 Accomplished Bit Rate (Kbit/s) Decoding Speed (frame/second) 350 Decoding Speed (frame/second) Decoding Speed (frame/second) 2500 300 250 200 150 100 50 0 800 (a) QCIF and CIF 500 1000 1500 Accomplished Bit Rate (Kbit/s) 2000 120 100 80 60 40 0 (b) 4CIF 40 35 30 25 20 1000 2000 3000 Accomplished Bit Rate (Kbit/s) 4000 0.15 0.1 0.05 0 3000 4000 5000 6000 7000 8000 9000 Accomplished Bit Rate (Kbit/s) (d) 1080P (e) 2160P Fig. 4. 500 1000 1500 Accomplished Bit Rate (Kbit/s) 2000 (c) 720P VP8 Best Preset VP8 Good2 Preset VP8 Good5 Preset H.264 High Quality No tuning Preset H.264 High Speed No tuning Preset H.264 Normal No tuning Preset 0.2 Decoding Speed (frame/second) Decoding Speed (frame/second) 2000 0 3000 4000 5000 6000 7000 8000 9000 4000 Fig. 3. 0 1500 VP8 Best Preset VP8 Good2 Preset VP8 Good5 Preset H.264 High Quality No tuning Preset H.264 High Speed No tuning Preset H.264 Normal No tuning Preset (d) 1080P 0 1000 (c) 720P 30 0 0 500 Accomplished Bit Rate (Kbit/s) Comparing VP8 and H.264 in Decoding Speed 1 0.95 0.9 0.85 0.8 0.75 0 200 400 600 800 Accomplished Bit Rate (Kbit/s) Ratio= Target / Accomplished Bit Rate Ratio= Target / Accomplished Bit Rate Ratio= Target / Accomplished Bit Rate 1.05 1.05 1 0.95 0.9 0.85 0.8 0.75 0 500 Ratio= Target / Accomplished Bit Rate Ratio= Target / Accomplished Bit Rate 0.8 0.6 0.4 2000 3000 Accomplished Bit Rate (Kbit/s) 4000 100 300 (a) Quality, Y-SSIM=0.97, QCIF and CIF 0.8 Encoding Speed (Frame/sec) 20 15 10 5 2500 Accomplished Bit Rate (Kbit/s) 0.5 2000 4000 6000 8000 Accomplished Bit Rate (Kbit/s) 10000 80 40 30 20 10 1000 1200 1400 1600 Accomplished Bit Rate (Kbit/s) 60 40 20 1800 0.12 0.1 0.08 7000 8000 9000 Accomplished Bit Rate (Kbit/s) 1200 (c) Quality, Y-SSIM=0.95, 720p Comparing VP8 and H.264 in Encoding Speed and Bitrate at Fixed Quality H.264 High Quality No tuning Preset H.264 High Quality PSNR tuning Preset H.264 High Quality SSIM tuning Preset H.264 High Speed No tuning Preset H.264 High Speed PSNR tuning Preset H.264 High Speed SSIM tuning Preset H.264 Normal No tuning Preset H.264 Normal PSNR tuning Preset H.264 Normal SSIM tuning Preset 0.97 30.5 SSIM 0.96 PSNR 1000 10000 0.98 30 0.95 0.94 29.5 0.93 6000 800 Accomplished Bit Rate (Kbit/s) (e) Quality, Y-SSIM=0.980, 2160p 31 5000 0 600 VP8 Best Preset VP8 Good0 Preset VP8 Good1 Preset VP8 Good2 Preset VP8 Good3 Preset VP8 Good4 Preset VP8 Good5 Preset H.264 High Quality No tuning Preset H.264 High Quality SSIM tuning Preset H.264 High Speed No tuning Preset H.264 High Speed SSIM tuning Preset H.264 Normal No tuning Preset H.264 Normal SSIM tuning Preset 0.14 0.06 6000 3000 (d) Quality, Y-SSIM=0.95, 1080p 29 4000 2000 0.6 0.16 Fig. 6. 1500 0.7 (b) Quality, Y-SSIM=0.97, 4QCIF 25 2000 1000 VP8 Best Preset VP8 Good2 Preset VP8 Good5 Preset H.264 High Quality No tuning Preset H.264 High Speed No tuning Preset H.264 Normal No tuning Preset 0.9 0 800 350 Accomplished Bit Rate (Kbit/s) 1500 500 Accomplished Bit Rate (Kbit/s) (c) 720P Encoding Speed (Frame/sec) Encoding Speed (Frame/sec) Encoding Speed (Frame/sec) 200 Encoding Speed (Frame/sec) 0 50 300 0 1000 0.8 2000 Comparing VP8 and H.264 in Bitrate Handling 400 250 1 (e) 2160P Fig. 5. 200 1.2 1 (d) 1080P 0 150 1.4 (b) 4CIF 1 1000 1500 Accomplished Bit Rate (Kbit/s) (a) QCIF and CIF 0 1000 1.6 7000 Accomplished Bit Rate (Kbit/s) 8000 0.92 4000 Fig. 7. Impact of Tuning Effect in H.264 [2160P] (a) Quality, Y-PSNR 5000 6000 7000 Accomplished Bit Rate (Kbit/s) 8000 (b) Quality, Y-SSIM