Proximal Policy Optimization with Adaptive Exploration

Andrei Lixandru
2024-05-08
Abstract:Proximal Policy Optimization with Adaptive Exploration (axPPO) is introduced as a novel learning algorithm. This paper investigates the exploration-exploitation tradeoff within the context of reinforcement learning and aims to contribute new insights into reinforcement learning algorithm design. The proposed adaptive exploration framework dynamically adjusts the exploration magnitude during training based on the recent performance of the agent. Our proposed method outperforms standard PPO algorithms in learning efficiency, particularly when significant exploratory behavior is needed at the beginning of the learning process.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The paper primarily aims to address the exploration-exploitation trade-off in Reinforcement Learning (RL). Specifically, it proposes a new algorithm called Proximal Policy Optimization with Adaptive Exploration (axPPO). This method aims to improve learning efficiency by dynamically adjusting the intensity of exploration, particularly excelling in the early stages of learning when extensive exploration is required. Traditional PPO algorithms introduce an entropy coefficient to encourage exploration, but this coefficient remains constant throughout the training process, limiting its performance. In contrast, the axPPO algorithm dynamically adjusts the entropy coefficient based on the agent's recent performance, enabling more efficient exploration. Experimental results show that in various environmental settings, especially in cases with high entropy coefficients, axPPO outperforms standard PPO. This indicates that dynamically adjusting exploration strategies based on agent performance can facilitate a more effective learning process. Therefore, axPPO provides a new approach to addressing the exploration-exploitation trade-off in reinforcement learning. However, more research and extensive comparative analysis are needed to fully understand the capabilities and limitations of axPPO.