PPO-ABR: Proximal Policy Optimization based Deep Reinforcement Learning for Adaptive BitRate streaming

Mandan Naresh,Paresh Saxena,Manik Gupta
DOI: https://doi.org/10.48550/arXiv.2305.08114
2023-05-14
Abstract:Providing a high Quality of Experience (QoE) for video streaming in 5G and beyond 5G (B5G) networks is challenging due to the dynamic nature of the underlying network conditions. Several Adaptive Bit Rate (ABR) algorithms have been developed to improve QoE, but most of them are designed based on fixed rules and unsuitable for a wide range of network conditions. Recently, Deep Reinforcement Learning (DRL) based Asynchronous Advantage Actor-Critic (A3C) methods have recently demonstrated promise in their ability to generalise to diverse network conditions, but they still have limitations. One specific issue with A3C methods is the lag between each actor's behavior policy and central learner's target policy. Consequently, suboptimal updates emerge when the behavior and target policies become out of synchronization. In this paper, we address the problems faced by vanilla-A3C by integrating the on-policy-based multi-agent DRL method into the existing video streaming framework. Specifically, we propose a novel system for ABR generation - Proximal Policy Optimization-based DRL for Adaptive Bit Rate streaming (PPO-ABR). Our proposed method improves the overall video QoE by maximizing sample efficiency using a clipped probability ratio between the new and the old policies on multiple epochs of minibatch updates. The experiments on real network traces demonstrate that PPO-ABR outperforms state-of-the-art methods for different QoE variants.
Multimedia
What problem does this paper attempt to address?
This paper aims to address the challenges of providing high - quality video streaming experience (QoE) in 5G and higher - level networks. Specifically, the paper points out that most traditional Adaptive Bit Rate (ABR) algorithms are designed based on fixed rules and are difficult to adapt to widely varying network conditions. Moreover, although recent Deep Reinforcement Learning (DRL) - based methods such as the Asynchronous Advantage Actor - Critic (A3C) method have shown potential in dealing with variable network conditions, these methods still have some limitations, especially the lag problem between the behavior policy and the target policy of the central learner, which can lead to sub - optimal updates when the two are out of sync. To solve these problems, the paper proposes a new system - Proximal Policy Optimization - based Deep Reinforcement Learning Adaptive Bit Rate Generation (PPO - ABR). This system improves sample efficiency in multiple mini - batch update cycles by using clipped probability ratios to limit the differences between new and old policy parameters. Experimental results show that PPO - ABR outperforms existing state - of - the - art methods on real - network traces and can effectively improve the overall QoE of video streaming. The main contributions of the paper are: - Proposing PPO - ABR, an improved DRL method for optimizing the ABR of video streaming. - Solving the problem of asynchronization between the behavior policy and the target policy in the A3C method by clipping the probability ratio. - Experimentally verifying the superior performance of PPO - ABR under different QoE metrics, especially when dealing with rapidly changing network conditions.