Abstract:The diversity of video delivery pipeline poses a grand challenge to the evaluation of adaptive bitrate (ABR) streaming algorithms and objective quality-of-experience (QoE) models. Here we introduce so-far the largest subject-rated database of its kind, namely WaterlooSQoE-IV, consisting of 1350 adaptive streaming videos created from diverse source contents, video encoders, network traces, ABR algorithms, and viewing devices. We collect human opinions for each video with a series of carefully designed subjective experiments. Subsequent data analysis and testing/comparison of ABR algorithms and QoE models using the database lead to a series of novel observations and interesting findings, in terms of the effectiveness of subjective experiment methodologies, the interactions between user experience and source content, viewing device and encoder type, the heterogeneities in the bias and preference of user experiences, the behaviors of ABR algorithms, and the performance of objective QoE models. Most importantly, our results suggest that a better objective QoE model, or a better understanding of human perceptual experience and behaviour, is the most dominating factor in improving the performance of ABR algorithms, as opposed to advanced optimization frameworks, machine learning strategies or bandwidth predictors, where a majority of ABR research has been focused on in the past decade. On the other hand, our performance evaluation of 11 QoE models shows only a moderate correlation between state-of-the-art QoE models and subjective ratings, implying rooms for improvement in both QoE modeling and ABR algorithms. The database is made publicly available at: \url{<a class="link-external link-https" href="https://ece.uwaterloo.ca/~zduanmu/waterloosqoe4/" rel="external noopener nofollow">this https URL</a>}.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the Quality of Experience (QoE) of Adaptive Bitrate (ABR) video streaming algorithms. Specifically, the paper aims to evaluate the performance of different ABR algorithms under various network conditions, video encoders, viewing devices and other factors by constructing a large - scale subjective rating database - WaterlooSQoE - IV, and analyze how these factors affect users' QoE perception. In addition, the paper also explores the effectiveness of current ABR algorithms and objective QoE models, and how to improve the performance of ABR algorithms by improving these models.
### Main Contributions
1. **Constructing a large - scale video database**:
- A large database named WaterlooSQoE - IV, which contains 1,350 real - time streaming videos from different source contents, encoders, network traces, ABR algorithms and viewing devices, was constructed.
- For fair comparison, each ABR algorithm was optimized on an independent set of training videos.
2. **Large - scale subjective experiments**:
- A large - scale subjective experiment was carried out in a controlled laboratory environment. The Mean Opinion Score (MOS) on three viewing devices was collected and its reliability was verified.
- Additional experiments were carried out to align the MOS obtained in different viewing sessions.
3. **In - depth analysis of influencing factors**:
- An interesting relationship between viewing conditions and re - buffering/quality adaptation perception was discovered, which was not observed in previous studies.
- Two types of heterogeneity in users' QoE perception were identified.
4. **Extensive evaluation of ABR algorithms**:
- It was unexpectedly found that the latest algorithms using advanced optimization schemes may not be superior to simple rate - based algorithms, and using perception - driven objective functions can significantly improve performance.
5. **Comprehensive evaluation of objective QoE models**:
- Eleven QoE models were calibrated on an independent training data set, their correlation with human opinions was evaluated, and the implementation of the models was made public.
### Research Background
With the approval of the Dynamic Adaptive Streaming over HTTP (DASH) standard in 2011, video distribution service providers began to shift from traditional connection - oriented video transmission protocols to DASH, because DASH can traverse network address translation and firewalls, reliably transmit video packets, flexibly respond to fluctuating network conditions, and reduce the workload of servers. ABR algorithms optimize the QoE of viewers by adaptively selecting the bitrate of downloaded media segments, but how to fairly evaluate the performance of these algorithms has become a key issue. Existing evaluation methods usually rely on objective indicators such as average bitrate, re - buffering time, joining time and bitrate switching, but these indicators often ignore the diversity of source contents, video codecs, viewing devices and personal preferences. Therefore, subjective evaluation is the most direct and reliable method for evaluating ABR technologies.
### Database Construction
- **Source Videos**: Five high - quality 4K Creative Commons - licensed videos were selected, covering various content types such as screen content, video games, movies, natural scenes and sports.
- **Encoding Configurations**: Each video was encoded into 13 different bitrate levels using H.264 and HEVC encoders.
- **Network Traces**: Multiple existing network data sets were used to generate nine network traces to simulate real - network conditions.
- **ABR Algorithms**: A variety of ABR algorithms including Rate - Based, Buffer - Based, FastMPC, Pensieve and RDOS were evaluated.
- **Viewing Devices**: Three commonly used viewing devices, namely Full - High - Definition (FHD) monitors, smart phones and Ultra - High - Definition (UHD) TVs, were considered.
### Subjective Tests
- **Test Methods**: An improved Single - Stimulus (SS) method was adopted, and an auxiliary task was introduced, which required participants to press a key whenever a re - buffering event occurred and provide an overall QoE score at the end of each video playback.
- **Experimental Process**: A subjective experiment lasting for 8 weeks was carried out in the Image and Visual Computing Subjective Testing Laboratory at the University of Waterloo, with 97 participants participating and being tested on FHD monitors, smart phones and UHDTVs respectively.
Through these works, the paper not only provides rich data support, but also provides important references for future QoE research and ABR algorithm optimization.