Leveraging YOLO-World and GPT-4V LMMs for Zero-Shot Person Detection and Action Recognition in Drone Imagery

Christian Limberg,Artur Gonçalves,Bastien Rigault,Helmut Prendinger
2024-04-02
Abstract:In this article, we explore the potential of zero-shot Large Multimodal Models (LMMs) in the domain of drone perception. We focus on person detection and action recognition tasks and evaluate two prominent LMMs, namely YOLO-World and GPT-4V(ision) using a publicly available dataset captured from aerial views. Traditional deep learning approaches rely heavily on large and high-quality training datasets. However, in certain robotic settings, acquiring such datasets can be resource-intensive or impractical within a reasonable timeframe. The flexibility of prompt-based Large Multimodal Models (LMMs) and their exceptional generalization capabilities have the potential to revolutionize robotics applications in these scenarios. Our findings suggest that YOLO-World demonstrates good detection performance. GPT-4V struggles with accurately classifying action classes but delivers promising results in filtering out unwanted region proposals and in providing a general description of the scenery. This research represents an initial step in leveraging LMMs for drone perception and establishes a foundation for future investigations in this area.
Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
The paper aims to explore the potential application of Large Multimodal Models (LMMs) in zero-shot person detection and behavior recognition in the field of drone perception. Specifically, the researchers evaluated two main LMM models—YOLO-World and GPT-4V—to address the following issues: 1. **Person Detection**: Utilizing the YOLO-World model for zero-shot person detection, particularly in images captured by drones. 2. **Behavior Recognition**: Using GPT-4V to classify the behavior of detected person regions, identifying the specific activities being performed. Traditional deep learning methods require large, high-quality training datasets, but obtaining these datasets can be very resource-intensive or impractical in certain robotic application scenarios. Therefore, the researchers aim to address this issue by using prompt-driven LMMs with strong generalization capabilities. The research results indicate that YOLO-World performs well in person detection, while GPT-4V, although challenging in accurately classifying behavior categories, excels in filtering unnecessary region proposals and providing an overall scene description. This study lays the foundation for applying LMMs in the field of drone perception and provides preliminary results for further research.