Decentralized Reinforcement Learning for Multi-Target Search and Detection by a Team of Drones

Roi Yehoshua,Juan Heredia-Juesas,Yushu Wu,Christopher Amato,Jose Martinez-Lorenzo
DOI: https://doi.org/10.48550/arXiv.2103.09520
2021-03-17
Abstract:Targets search and detection encompasses a variety of decision problems such as coverage, surveillance, search, observing and pursuit-evasion along with others. In this paper we develop a multi-agent deep reinforcement learning (MADRL) method to coordinate a group of aerial vehicles (drones) for the purpose of locating a set of static targets in an unknown area. To that end, we have designed a realistic drone simulator that replicates the dynamics and perturbations of a real experiment, including statistical inferences taken from experimental data for its modeling. Our reinforcement learning method, which utilized this simulator for training, was able to find near-optimal policies for the drones. In contrast to other state-of-the-art MADRL methods, our method is fully decentralized during both learning and execution, can handle high-dimensional and continuous observation spaces, and does not require tuning of additional hyperparameters.
Robotics,Machine Learning,Multiagent Systems
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to coordinate a group of drones (multi - agent systems) to efficiently search and detect multiple static targets in an unknown large - scale environment. Specifically, the paper proposes a method based on multi - agent deep reinforcement learning (MADRL), enabling the drone team to complete tasks autonomously without centralized control. ### Background and Challenges of the Problem 1. **Multi - target Search and Detection in Complex Environments** - Searching and detecting targets involves multiple decision - making problems, such as coverage, surveillance, search, observation, and pursuit - evasion. - In practical applications, military and emergency response teams often need to locate missing persons or survivors in disaster scenarios. 2. **Limitations of Existing Methods** - Traditional methods usually divide the surveillance area into multiple units (such as Voronoi units) and design path - planning algorithms for each unit. - These methods require direct communication, are difficult to handle online drone failures, and cannot guarantee the optimality of the final solution. ### Solutions Proposed in the Paper 1. **Multi - agent Deep Reinforcement Learning (MADRL) Method** - A fully decentralized MADRL method, called Decentralized Advantage Actor - Critic (DA2C), is proposed. - This method is fully decentralized during both learning and execution, can handle high - dimensional continuous observation spaces, and does not require adjusting additional hyper - parameters. 2. **Design of the Simulator** - A realistic drone simulator is developed for training and evaluating reinforcement learning models. - The simulator takes into account the dynamic changes and uncertainties in real - world experiments, including statistical inferences extracted from experimental data. ### Main Contributions 1. **Decentralization** - Unlike other MADRL methods, this method does not require any communication during both learning and execution, thereby improving the robustness and adaptability of the system. 2. **Efficiency** - The experimental results show that this method can find near - optimal strategies within a relatively short training time, significantly outperforming random strategies and collision - free strategies. 3. **Scalability** - The impact of different numbers of drones and targets on task performance is studied, and it is found that increasing the number of drones can improve the success rate of target detection, but the performance improvement is limited after a certain number. ### Formula Representation To ensure the correctness and readability of the formulas, the following are some key formulas involved in the paper: - **Expected Discounted Reward of the Value Function** \[ V^\pi(s)=\mathbb{E}\left[\sum_{t = 0}^{h - 1}\gamma^tR(\vec{a}_t,s_t)\mid s,\pi\right] \] where \(V^\pi(s)\) represents the expected discounted reward starting from state \(s\) under policy \(\pi\). - **Policy Gradient Theorem** \[ \nabla_\theta J(\theta)=\mathbb{E}_{s,a\sim\pi}[Q^\pi(s,a)\nabla_\theta\log\pi_\theta(a\mid s)] \] - **Policy Gradient with Baseline** \[ \nabla_\theta J(\theta)=\mathbb{E}_{s,a\sim\pi}[(Q^\pi(s,a)-b(s))\nabla_\theta\log\pi_\theta(a\mid s)] \] where \(b(s) = V^\pi(s)\) is the baseline function. - **Loss Function** \[ L=\lambda_\pi L_\pi+\lambda_v L_v-\lambda_H\mathbb{E}_{s\sim\pi}[H(\pi(\cdot\mid s))] \]