Overcoming Delayed Feedback Via Overlook Decision Making

YaLou Yu,Bo Xia,Minzhi Xie,Xueqian Wang,Zhiheng Li,Yongzhe Chang
DOI: https://doi.org/10.1109/smc53992.2023.10394201
2023-01-01
Abstract:Reinforcement learning is one of the most general paradigms to solve sequential decision making issues on the assumption that the action selection and environmental feedback are instantaneous, however, unfortunately this assumption is rarely true with regard to such ubiquitous delays in real-world system which could degrade the performance of reinforcement learning algorithms. The most common solution to solve a fixed delay problem is to design a forward dynamic model which is used to predict the newest state by recursively iterating over long steps so that a predicted state can be got and it would be taken as the agent's observation to make the newest decision. However, there exists cumulative errors during the iterative process which make long-term prediction inaccurate and further affect agent's decision. Motivated by the goal to reduce cumulative errors, we propose a new algorithm named Multi-step Prediction model with Delayed Observation(MPDO), aiming at accurately predicting future state at longer horizons for better decision making. Our approach includes two parts: a multi-step prediction model and a strategy training based on proximal policy optimization algorithms(PPO). Our model only needs a small amount of data to conduct dynamic modeling quickly, and the accuracy of prediction and iteration speed are higher than traditional methods. Experiments on Gym and MuJoCo show that MPDO achieves higher performance in such different tasks with different delays compared with other state-of-the-art methods, which verify our method's effectiveness.
What problem does this paper attempt to address?