Video Quality Testing for the IP and Internet Domains
With the proliferation in the amount of programming and the number of devices and screen sizes today, testing the output of various video delivery protocols and equipment is a crucial step in deciding which combination yields the best downstream video quality — and therefore which systems to invest in. It’s a never-ending process for satellite, broadcast, cable, and IPTV network technologists, system architects, and engineers, because techniques, equipment, standards, and formats are always evolving.
Full-reference testing is the most accurate method for assessing changes in video quality between the source video and the downstream version. The basic idea is to take a short video clip, send it through a system to be tested, and then compare the system’s output to the original (See Figure 1). If the system causes any differences in any of the video frames, those differences can be measured using a variety of objective metrics to yield numerical results — results that are tied to accepted, standardized databases derived from human-vision-based studies. In other words, the number will tell you, based on the scale you’re using, how close the processed signal is to the original, and the results will very closely approximate what human viewers would judge the quality to be when viewing the same video sequence on their own screens.
Today most traditional television signals are maintained at specific resolutions and frame rates from the start of production through transmission to the end viewer. This consistency makes reference-based testing in the traditional television domain relatively easy to do because you can use the original, unprocessed source video as the basis for comparison.
However, performing reference-based testing in the IP domain is not so simple. The advent of IPTV and Internet-enabled devices means the video coming out of the system will often be delivered at a lower resolution/frame rate than the source material in order to accommodate equipment other than televisions.
To complicate the matter further, it is becoming more and more economical to deploy delivery systems that adapt to the conditions of the network and the requirements of end devices. These adaptive systems make the testing process even more complex because it means that content providers must be prepared to deliver multiple profiles (resolutions, bit rates, frame rates) for every asset, with understood levels of quality for every instance of the delivery chain and end-device type.
In this scenario, full-reference testing is still the best way to assess video quality. The method applies broadly, not just to functional testing of an adaptive-bit-rate (ABR) service, but to lab testing that helps determine the optimal combination of bit rates, frame rates, resolutions, and equipment for the various profiles. (For example, full-reference testing can be especially useful when trying to decide the quality of the new HEVC encoders.)
Figure 1: Encoder B Reference @ 15Mbps vs. Encoder B Test @ 350kbps
The Challenge of Full Reference Testing of IP Content
The basis for reference quality testing is that the source and after-processed sequences are required to be at the same resolution and frame rate to create and maintain an accurate measurement of the video quality as tests are created at multiple downstream profiles. But this is harder to do for IP content than for traditional television content because you first have to reduce the resolution/frame rate of the source material, which is most often HD quality, in order to create an acceptable reference. Only then does it make sense to compare it to the encoded/transcoded material that the end user will see.
Many operators have opted to offer a multiprofile adaptive streaming service in which they perfect a fixed set of delivery profiles that will satisfy most of the target devices at any given time — devices that themselves must adapt to variances in network conditions. For example, some services might have at the ready three low-bit-rate delivery profiles for cell phones, three in the mid range for tablets, and two high-end profiles for televisions. The idea is to provide multiple profiles so that any end device can adapt its playback for any change in network conditions. All of the six lower-resolution profiles are created from an original 1080i or 720p source. In order to test those profiles, you must somehow compare the HD source video to the lower-resolution video that will ultimately come out of an encoder on the viewer’s end.
To create the content streams for IP delivery, you must scale (and, in the case of a 1080i source, deinterlace) the original HD video for each profile. And in order to test those streams, you must create similarly processed streams to use as references during the test.
Figure 2: An Example Number of Streams Per Program
Using Multiple Tools Skews Measurements
Challenges arise when deinterlacing and/or scaling the video with separate tools that do not match the deinterlacing and scaling algorithms in the encoder under test. Because the target encoder — and in the case of ABR services, the transcoder — has its own set of processes, the practice of using separate deinterlacing and scaling tools on the front end can create additional artifacts that exacerbate the differences between the reference material and the downstream output, which could result in a lower score on the chosen video quality index.
A New Methodology for Video Quality Measurement for ABR
There is a new methodology for measuring video quality in the IP domain. It is based on the simple concept of reducing resolutions, but, more importantly, using the target device to create the reference video sequence so as to minimize the artifacts and/or degradations that result. This method, while still being refined, is a proven, mathematically accurate means of objectively comparing different profiles and equipment.
In full-reference testing, the idea is always to use the most pristine reference possible in order to get a true sense of the differences based on the metric and the scoring index that you’re using. When testing IP streams, the best way to arrive at a pristine reference is to use the same encoder to create the reference as you do to create the downstream deliverables. When you reduce the resolution using the target processing device to match the test profile’s resolution at the highest possible bit rate, you minimize the differences created in picture artifacts when scaling and deinterlacing an image. While this process compromises the video quality of the source slightly, it’s the only way to ensure that the reference and the low-bit-rate versions are being deinterlaced and/or scaled according to the same algorithms, and the best way to get a quality measurement that is as true as it can be to the index that’s being used.
Creating the references and test streams in this way is part of a new methodology for measuring quality of adaptive IP streams. From there it’s a good idea to use the same measurement system and scoring index — such as the Multi-Scale Structural Similarity (MS-SSIM) algorithm and the Differential Mean Opinion Score (DMOS) scale — throughout the entire testing process. (Following these steps ensures repeatable results that aren’t skewed by unnecessary artifacts.) In the new methodology, a video engineer with a trained eye then considers the resulting score while visually comparing the reference and the processed stream side by side.
Some operators have already employed this methodology using these steps:
- Generate a mezzanine-quality reference for each profile using the highest possible bit rate and optimal encoding parameters.
- Generate each profile test signal using the application’s encoding parameters.
- Calculate quality using the MS-SSIM on DMOS scale , comparing each test profile signal to the appropriate reference profile signal.
- Analyze results on DMOS scale against visual comparisons of source and downstream profile test segments.
The process is repeated for each stream that needs testing, creating a new version of the reference for each profile or device under test, and making incremental adjustments to the variables one at a time in order to test the effect on quality. Given the number of combinations of encoders, bit rates, frame rates, and resolutions in a lab setting — compounded by the fact that there are multiple types of source video — operators need to create a significant number of references and processed test streams.
The first steps in the process — creating the various references and setting up the tests — are manageable, but repeating the test manually for every permutation and then visually comparing the results is a very involved process that can be nearly impossible for one person or even a lab full of people to complete in any sort of constructive time frame.
To manage the testing phase, the methodology calls for an automated tool to measure video quality in the many different combinations of bit rates, frame rates, and lower-than-broadcast resolutions, one that relies on an algorithm that yields numerical results based on human-perception scales. At the same time, the solution automatically creates measurement charts and synchronized, side-by-side picture comparisons that let experienced video engineers view the differences between the reference and the processed video — because even with the most accurate human-vision estimation metrics available today, the decisions still come down to a visual check of the video content.
Figure 3: Full-reference testing with ClearView provides measurement results and
a side-by-side view of source and target delivery video.
Video Clarity’s ClearView Video Analysis Solution
Video Clarity’s ClearView A/V quality analyzer records the references created at high bit rates at the target resolutions from the device under test. The user then matches them up with the corresponding profiles created by that same device for testing. This process can be done manually or automatically in a scripted routine to align, measure, and log the results.
Figure 3 shows the process.
An HD video source feeds an uncompressed signal to the encoder’s input. The HD video could come from a file or from any number of other sources, such as live video streaming from one or multiple HD cameras in a basketball arena.
The output of the transcoder goes into ClearView, where it is captured in real time and simultaneously decoded to form one sequence in a set of full-reference test sequences. There might be multiple sources of video for a given sequence (e.g., basketball action, a crowd shot, flashing lights, and animation), so it’s necessary to create a reference for each of those types and compare it against multiple bit rates for each. The idea is to accommodate changing network conditions between the source server and the encoded stream or endpoint streaming device (e.g., tablet, handheld, or set-top with HD output).
Additional outputs are then captured and added to the sequence set by making incremental changes in the transcoder bit rate. More sets of full-reference test sequences can be generated by using different source content and repeating the process of capturing and storing the encoder output sequences.
Then ClearView performs several types of measurements on a complete set of full-reference test sequences. Using the MS-SSIM algorithm, ClearView systems presents results using the associated MS-SSIM score as well as the DMOS scale. (It can also test using the Sarnoff JND —Just Noticeable Differences, MOVIE Temporal, MOVIE Spatial as well as overall MOVIE metric and present results on their native scales.) ClearView automatically aligns the sequences, measures their quality, creates comparison charts, and places the uncompressed reference and processed test streams side by side on HDTV or UHDTV monitors for visual inspection.
The graph in Figure 3 shows an estimated trend of many sequence score averages that would be expected if quality is measured by using MS-SSIM on the DMOS scale, where a lower score indicates better picture quality. Blue and red lines show a potential differing quality average between two encoders. In general, quality will increase as bit rates increase, signified by downward-sloping curves on the graph.
For any given type of source content, most points in the quality curve of the encode process will perform differently at different bit rates.
Figure 4 depicts a single example of an actual DMOS test comparing one piece of reference video processed at a single bit rate from two different encoders. The numerical results of this test are plotted on the graph below the images, yielding a frame-for-frame comparison of the two codecs under test. For purposes of this paper and an understanding of the visual differences of these two encoded frame sets, a single frame of both encodes is shown as a comparison. By overlaying the results of a given piece of video going through different encoders, the graph demonstrates how one encoder compares to another over a given number of frames.
Different types of content will create differing results from each encoder, therefore multiple sources must be encoded and tested at multiple bit rates in order to understand how well each encode process performs.
Figure 4: Encoder A Reference @ 15Mbps vs. Encoder A Test @ 350kbps; Encoder B Reference @ 15Mbps vs. Encoder B Test @ 350kbps
Distributing low-resolution video via IP is a major part of any entertainment delivery operation today, and many operators are using ABR services to do it. The nature of ABR services means there will be downstream deliverables in multiple profiles derived from a single HD source — all of which require test and measurement to verify their quality. The full-reference test method is generally considered the best test method in this case, and the way to ensure the most accurate test result is to create the reference sequences and the lower-resolution test streams using the same encoding device so that they are identically formatted. From there, a testing tool can handle aligning the reference and test streams and provide repetitive measurements, generating numerical quality scores as well as visual comparisons. This methodology gives operators the truest possible picture of how their downstream IP deliverables match up to the HD source video, which in turn helps them make better decisions about applied processing and equipment investments.
- “A Proposed VQM Methodology for ABR Networks.” Pierre Costa LMTS, and Priyadarshini Anjanappa,
AT&T Laboratories. Presented to the Video Services Forum, April 15, 2014.
- “Achieving Maximum Accuracy in Video Quality Measurement” white paper
- “Analyzing 4K Video Quality” white paper
- ClearView Data Sheet
Video Clarity would like to recognize and thank Pierre Costa and Priyadarshini Anjanappa of AT&T Laboratories for their contributions to this white paper.