Multi-USV Task Planning Method Based on Improved Deep Reinforcement Learning
Jing Zhang,Jia Ren,Yani Cui,Delong Fu,Jingyu Cong
DOI: https://doi.org/10.1109/jiot.2024.3363044
IF: 10.6
2024-01-01
IEEE Internet of Things Journal
Abstract:A safe and reliable task planning method is a prerequisite for the collaborative execution of ocean observation data collection tasks by multiple unmanned surface vessels (multi-USVs). Deep reinforcement learning (DRL) combines the powerful nonlinear function-fitting capabilities of deep neural networks with the decision-making and control abilities of reinforcement learning, providing a novel approach to solving the multi-USV task planning problem. However, when applied to the field of multi-USV task planning, it faces challenges, such as a vast exploration space, extended training times, and unstable training process. To this end, this article proposes a multi-USV task planning method based on improved DRL. The proposed method draws on the idea of a value decomposition network, breaking down the multi-USV task planning problem into two subproblems: 1) task allocation and 2) autonomous collision avoidance. Different state spaces, action spaces, and reward functions are designed for the various subproblems. Based on this, a deep neural network is used to map the state space of each subproblem to the action space of each USV, and the generated strategy of the deep neural network is assessed based on the corresponding reward function. This successfully integrates task allocation and path planning into a comprehensive task planning framework. Deep neural networks consist of the Actor networks and the Critic networks. During the training phase of the Critic network, different methods are used to train different Critic networks to improve the convergence speed of the algorithm. An improved temporal difference error method is specifically applied to train the Critic network for evaluating autonomous collision avoidance strategies, resulting in improving the autonomous collision avoidance ability of USVs. At the same time, to improve the efficiency of task allocation, hierarchical mechanisms and regional division mechanisms are introduced to construct subsystem task planning models, which further decompose the task planning problem. A combination of successor features and an improved temporal difference error method is specifically applied to train another Critic network for evaluating the subsystem's task allocation schemes and collaborative motion trajectories, aiming to enhance the allocation efficiency of the subsystems. Furthermore, transfer learning is employed to merge the subsystem task planning, using it as a constraint to direct the exploration and assessment of both the cluster task allocation schemes and the cluster collaborative motion trajectories. This enables rapid and accurate learning for task allocation within the multi-USV cluster. During the training phase of the Actor network, the introduction of the experience replay method and target network technique is employed to enhance the proximal policy optimization algorithm. This facilitates distributed joint training of the Actor network, thereby improving the accuracy of the algorithm. Simulation results validate the effectiveness and superiority of this method.