Patrol Agent: an Autonomous UAV Framework for Urban Patrol Using on Board Vision Language Model and on Cloud Large Language Model

Zihao Yuan,Fangfang Xie,Tingwei Ji
DOI: https://doi.org/10.1109/icrcv62709.2024.10758606
2024-01-01
Abstract:Unmanned Aerial Vehicles (UAVs) used for urban patrols typically require human control or supervision. To enhance the automation of UAV s in this context, we propose the Patrol Agent, which is able to patrol, identify and track a target in a fixed area autonomously without any human intervention. The Patrol Agent employs Vision Language Model (VLM) for accurate visual information, object detection model for rough detection about the target, and Large Language Model (LLM) deployed on cloud for analysis and action-deciding. During patrols, the agent uses a lightweight VLM to generate captions of the scenes it observes. These captions are then sent to the LLM on cloud for further analysis which provides responses regarding the danger level of the scene, appropriate actions to take, and the detailed reasons behind these actions. When the agent identifies and tracks a target, it activates the VLM only when the object detection model detects an object corresponding to the target. This approach conserves computing resources and enhances onboard operational speed. The proposed agent can identify and track targets without requiring fine-tuning data or human intervention. It outperforms Visual Question Answering (VQA) models in patrol and uses fewer computing resources compared to agents that solely rely on VLM for tracking.
What problem does this paper attempt to address?