VisionGPT: LLM-Assisted Real-Time Anomaly Detection for Safe Visual Navigation

Hao Wang,Jiayou Qin,Ashish Bastola,Xiwen Chen,John Suchanek,Zihao Gong,Abolfazl Razi
2024-03-19
Abstract:This paper explores the potential of Large Language Models(LLMs) in zero-shot anomaly detection for safe visual navigation. With the assistance of the state-of-the-art real-time open-world object detection model Yolo-World and specialized prompts, the proposed framework can identify anomalies within camera-captured frames that include any possible obstacles, then generate concise, audio-delivered descriptions emphasizing abnormalities, assist in safe visual navigation in complex circumstances. Moreover, our proposed framework leverages the advantages of LLMs and the open-vocabulary object detection model to achieve the dynamic scenario switch, which allows users to transition smoothly from scene to scene, which addresses the limitation of traditional visual navigation. Furthermore, this paper explored the performance contribution of different prompt components, provided the vision for future improvement in visual accessibility, and paved the way for LLMs in video anomaly detection and vision-language understanding.
Computer Vision and Pattern Recognition,Human-Computer Interaction
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem of real - time anomaly detection in visual navigation, especially providing safe navigation assistance for the visually impaired. Specifically, this research combines a lightweight moving object detection model and large language models (LLMs) to achieve zero - sample anomaly detection and generate concise audio descriptions to alert users to potential obstacles or dangers. The following are the main problems that this paper attempts to solve: 1. **Object Detection Challenges in Dynamic Environments**: - In complex urban environments, traditional object detection models (such as YOLOv8) encounter difficulties in handling dynamic scenes, especially in developing custom - category labels. These problems lead to insufficient handling of long - tail responses. - The paper proposes a zero - sample learning method, using large language models (LLMs) and open - vocabulary object detection models (such as YOLO - World) to address these challenges. 2. **The Need for Real - Time Visual - Language Understanding**: - For the visually impaired, walking safely in streets, sidewalks and other public spaces requires real - time visual - language understanding ability. This includes not only recognizing objects, but also understanding the scene and issuing alarms in a timely manner. - The paper meets this need by integrating LLMs and real - time object detection models, ensuring that users can receive immediate safety tips. 3. **Dynamic Scene Switching and Interest Setting**: - Traditional visual navigation systems have limitations when switching between different scenes. The framework proposed in the paper can dynamically adjust object detection categories, switch scenes according to user needs, and allow users to interact with the LLM module to set priority tasks (for example, finding the nearest bench). 4. **Low - Latency Real - Time Feedback**: - Real - time visual navigation requires extremely low latency in order to respond quickly in complex scenes. The paper optimizes the system so that it can operate with very low latency, ensuring real - time feedback. 5. **Safe Applications of Visual - Language Understanding**: - Although past research has explored the application of LLMs in visual assistance and navigation, few studies have specifically focused on safety issues. The paper fills this gap, focusing on using visual - language understanding techniques to improve the safety of navigation. ### Summary By combining the capabilities of real - time object detection and large language models, this paper proposes a new framework for providing safe visual navigation assistance for the visually impaired in dynamic environments. This framework can not only identify potential obstacles and dangers, but also generate personalized scene descriptions and safety notifications, thus ensuring that users navigate safely in complex environments.