How Many Humans Does it Take to Judge Video Quality?
In order to reach your home PC or TV, video sequences go through many stages:
- acquired via a camera
- edited in your favorite video editor
A number of video quality compromises occur at each step. How can you tell if the video quality is good enough? One obvious way is to solicit the opinion of a group of human observers. After all, the video sequences are meant for humans.
What Causes Artifacts (Video Errors)?
Camera optics and editing affects the video sequence in minimal ways. These effects can easily be minimized if care is taken to keep the camera in focus and to edit the video sequences at the same resolution. To simplify this paper, the video sequences after editing are assumed to be perfect.
In an ideal world, the home has enough bandwidth to receive the pure, uncompressed video sequence. To transmit a standard NTSC video, the transmission media needs to handle 720(w)*480(h)*30(fps)*2(bytes) = 21MB/s and that does not include the audio or any ancillary data (like subtitles). The total rate is closer to 27MB/s. To exasperate the problem, HDTV is six times bigger (160MB/s). A good home connection is around 1MB/s (8Mbps). So, how do you send 160MB/s down a 1MB/s pipe?
Figure 1: Squeeze the Video through the Pipe
To send 160MB/s down a 1MB/s pipe, data compression must be used. Moreover, in the case of High Definition the compression must be about 160:1.
The best video quality at a given bit-rate is the goal of video compression. The idea is to take advantage of the human eye and brain, which can fill in details automatically based on experience. Generally, the brain can compensate for blurred and loss of color fidelity in a video sequence. Compression techniques take advantage of this innate power.
Compression algorithms apply many different techniques to achieve the desired bit-rate reduction. Generally, the following techniques generate the largest effects:
- The color space is reduced – a term called chrominance subsampling – by converting from the RGB to Y’CbCr. Y’CbCr uses 2-bytes (in 4:2:2) or 12-bits (in 4:2:0 or 4:1:1) compared to 3-bytes in RGB (4:4:4).
- The compression algorithm removes redundant pixels spatially by grouping them into 8×8, 4×8, etc. blocks. The blocks are then converted to the frequency domain and quantized. The effect of quantization is a loss of brightness and color.
- The blocks are combined together into a macroblocks – normally 16×16 blocks – as a basis for interframe compression. This frame is termed the reference frame (or I frame). The next frame records the motion of the macroblocks, and not the macroblock itself – unless the macroblock cannot be found.
Due to these techniques several artifacts can occur:
- Some colors cannot be produced
- The differences in the luminance or chrominance values at block boundaries cause visible artifacts to form. To reduce these artifacts, the algorithms blur the video sequences to make the edges softer. Of course, this is just another form of artifact.
- If a reference frame is lost, then all data between it and the next reference frame has no value
On the transmission side, more errors can occur. The Internet can re-order or lose packets randomly. Cloud cover can affect Satellite transmissions, and squirrels can eat away at your underground cables. Transmission algorithms can detect and conceal these errors. The concealment results in frozen video sequences, which audience won’t notice if the freeze is short enough. If the error is too long, the result transmission errors will cause a mix of frozen images and green blocks.
So How Do You Judge the Video Quality?
Since artifacts can enter the system at various stages, how do you find the cause of the errors? You can view the video sequence after each stage and judge whether the quality is good. This is a very time and resource consuming process. Moreover, what does “good” mean? Probably, a better word is adequate or good enough.
The Video Quality Expert Group (VQEG) created a specification for subjective video quality testing and submitted it to the governing body as ITU-R BT.500 Recommendation. This recommendation describes methods for subjective video quality analysis where a group of human testers analyze the video sequence and grade the picture quality. The grades are combined, correlated and reported as Mean Opinion Score (MOS).
The main idea can be summarized as follows:
- Choose video test sequences (known as SRC)
- Create a controlled test environment (known as HRC)
- Choose a test method (DSCQS SSCQE)
- Invite a sufficient number of human testers (20 or more)
- Carry out the testing
- Calculate the average score and scale
Since compression is at least 160:1, the video quality will not be perfect. Comparing the original sequence to the compressed sequence and simply reporting that they are different is not the goal. The goal is to say that the video quality after compression is good enough – not perfect.
Choosing Video Test Sequences
VQEG offers a set of generally hard to compress standard and high-definition video sequences. These can be used, but they may not be typical of your test environment. If a news channel is being tested for video quality, then the content is mainly faces, graphs and some scenery. On the other hand, a sports channel has quick moving action and crowds of people waving and chanting. Thus, care must be taken to define a set of video sequences that reflects an adequate diversity in video content.
Regardless, the video sequence should be scaled to the display screen size so that the display graphics are not re-scaling the video sequence. Further, compression should take place on the scaled video sequence to create a fair comparison.
Creating a Test Environment
The hardware environment under test is quite likely the video encoder, transmission media, set-top box/PC, and a display. Each display device should be calibrated and if multiple displays are used they must be the same model. The human testers stand a preset distance from the display (usually 3 times the height of the display).
Choose a Test Method
The two basic subjective methods for picture quality assessment are:
- the DSCQS (Double Stimulus Continual Quality Scale) method
- the SSCQE (Single Stimulus Continual Quality Evaluation) method
The two methods basically differ in that one method makes use of a reference (DSCQS) video signal and the other (SSCQE) does not. The human testers assess the video sequence or compare the video sequence to the reference video sequence and issue quality scores from 0 (Bad) to Max (Excellent). Max is 5, 7, 9, or 100 depending on the type of test.
When the perceived difference between the reference and compressed video sequences are small, people prefer the DSCQS method. The human testers are shown a small subset of the video sequence (typically 10 seconds) and they can easily judge small differences in video quality. Long sequences are harder to evaluate as the human testers become fatigued seeing 2 sets of video sequences for each test.
The SSCQE method requires the human testers watching 20-30 minutes of a single video sequence. This method is ideal when measuring the experience of the home user, but it is harder to judge. After all, how good is good enough? Program content tends to significantly influence the SSCQE scores. Another interesting point is that human memory is not symmetrical: Humans are quick to criticize degradation in video quality; while slow to reward improvements.
Inviting Human Testers and Training
You must brief the human testers about the goals of the experiment and have them tested for vision problems.
Reference video sequences are shown to set target expectations. The following 2 figures show an uncompressed image next to a compressed image. Figure 3 shows the images zoomed to clearly show the artifacts. In this example the mean-opinion-score (MOS) is a 3.5 on a 5 scale. In other words, the video quality is above average.
Figure 2: Comparing Images: Blocking Errors
Figure 3: Comparing Images: Transmission Errors
Calculate the Average Score and Scale
Although subjective testing conforms to a specification – ITU-R BT.500 Recommendation – successive tests will result in different test scores. The human testers’ subjective scale, expectation, and experience will influence the score. Thus, even when you apply the same subjective test methodology, you will have some gain and shift in the scores. Thus, you must use a secondary test rating which eliminates data that is outside of the average (inexperience human tester) and linearly scales (higher or lower) using regression techniques to normalize the data.
How about using an Algorithm to Measure Video Quality Objectively?
Objective Video Quality Measurement seeks to determine the quality of the video sequences algorithmically. While algorithms can track the sum of differences between the references and compressed video sequences, this is not the goal. The goal is to check if the video quality is good enough after compression. To date, video quality algorithms try to model with good correlation the subjective scores of human testers. This is a little unintuitive. How do you objectively score a subjective test?
Much like Subjective Testing methods defined above, Video Quality Assessment algorithms can be classified into 3 categories:
- Full Reference (FR) methods compare the reference (perfect) video sequence to the processed (distorted) video sequence. Full Reference is generally used as a tool when designing video processing algorithms and when assessing video quality equipment vendors. Examples of Full Reference algorithms include Sarnoff JND, PSNR, and SSIM.
- No Reference (NR) methods estimate the distortion level in the video sequence with no knowledge about the original sequence. No Reference algorithms can be used in any settings, but the algorithms are inflexible in making accurate quality predictions.
- Reduced Reference (RR) methods use partial reference information to judge the quality of the distorted signal. If a side channel is present, then Reduced Reference can be used to monitor video quality.
What is Video Clarity?
Video Clarity provides a frame work for video quality testing. Video Clarity’s ClearView system combines a video server, video recorder, and file decoder with multiple viewing modes and objective metrics.
ClearView captures video content from virtually any source-file, digital or analog source such as SDI, HD-SDI, DVI, VGA, HDMI, component, composite, or S-video. Regardless of the input, ClearView converts the video sequence to fully uncompressed 4:2:2 Y’CbCr or 4:4:4 RGB. This allows different compression algorithms to be compared and scored relative to each other.
ClearView includes many Objective Metrics to mimic the human visual system. The most famous metrics are Sarnoff’s JND and PSNR. The video sequences are normalized, aligned both spatially and temporally, and return an automated, repeatable, and inherently objective pass/fail score, which can be run on any single or series of video frames. Moreover, the scores along with the original video sequences can be shown in multiple viewing modes with VCR-like controls. This allows the operator to see why the objective metric scored the video sequences.
Figure 4: Score the Video and See Why!
Have you ever wanted to compare your H.264/VC-1 & MPEG-2 algorithms relative to each other? Now you can. You can even measure video delay and audio/video lip-sync.
Read more case studies at – http://www.videoclarity.com/videoqualityanalysiscasestudies.