Modeling a Continuous Locomotion Behavior of an Intelligent Agent Using Deep Reinforcement Technique

Stephen Dankwa,Wenfeng Zheng
DOI: https://doi.org/10.1109/CCET48361.2019.8989177
2019-01-01
Abstract:In this current research work, we applied a Twin- Delayed DDPG (TD3) algorithm to solve the most challenging virtual Artificial Intelligence application by training a HalfCheetah robot as an Intelligent Agent to run across a field. Twin-Delayed DDPG (TD3) is a recent breakthrough smart AI model of a Deep Reinforcement Learning which combines the state-of-the-art techniques in Artificial Intelligence, including continuous Double Deep Q-Learning, Policy Gradient and Actor-Critic. These Deep Reinforcement Learning approaches have the capabilities to train an Intelligent agent to interact with an environment with automatic feature engineering, that is, requiring minimal domain knowledge. Twin-Delayed Deep Deterministic Policy Gradient algorithm (TD3) was built on the Deep Deterministic Policy Gradient algorithm (DDPG). During the implementation of the TD3 model, we used a two- layer feedforward neural network of 400 and 300 hidden nodes respectively, with Rectified Linear Units (ReLU) as an activation function between each layer for both the Actor and Critics, and then a final tanh unit following the output of the Actor. Overall, we developed six (6) neural networks. The Critic received both the state and action as input to the first layer. Both the network parameters were updated using the Adam optimizer. The implementation of the TD3 algorithm was made possible by using the pybullet continuous control environment which was interfaced through the OpenAI Gym. The idea behind the Twin-Delayed DDPG (TD3) is to reduce overestimation bias in Deep Q-Learning with discrete actions which are ineffective in an Actor-Critic domain setting. After exposing the Agent to training for 500,000 iterations, the Agent then achieved a Maximum Average Reward over the evaluation time-step of approximately 1891. Twin-Delayed Deep Deterministic Policy Gradient (TD3) has prominently improved both the learning speed and performance of the DDPG in a challenging task in a continuous control setting.
What problem does this paper attempt to address?