Abstract:This paper explores the potential of Large Language Models(LLMs) in zero-shot anomaly detection for safe visual navigation. With the assistance of the state-of-the-art real-time open-world object detection model Yolo-World and specialized prompts, the proposed framework can identify anomalies within camera-captured frames that include any possible obstacles, then generate concise, audio-delivered descriptions emphasizing abnormalities, assist in safe visual navigation in complex circumstances. Moreover, our proposed framework leverages the advantages of LLMs and the open-vocabulary object detection model to achieve the dynamic scenario switch, which allows users to transition smoothly from scene to scene, which addresses the limitation of traditional visual navigation. Furthermore, this paper explored the performance contribution of different prompt components, provided the vision for future improvement in visual accessibility, and paved the way for LLMs in video anomaly detection and vision-language understanding.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problem of real - time anomaly detection in visual navigation, especially providing safe navigation assistance for the visually impaired. Specifically, this research combines a lightweight moving object detection model and large language models (LLMs) to achieve zero - sample anomaly detection and generate concise audio descriptions to alert users to potential obstacles or dangers. The following are the main problems that this paper attempts to solve: 1. **Object Detection Challenges in Dynamic Environments**: - In complex urban environments, traditional object detection models (such as YOLOv8) encounter difficulties in handling dynamic scenes, especially in developing custom - category labels. These problems lead to insufficient handling of long - tail responses. - The paper proposes a zero - sample learning method, using large language models (LLMs) and open - vocabulary object detection models (such as YOLO - World) to address these challenges. 2. **The Need for Real - Time Visual - Language Understanding**: - For the visually impaired, walking safely in streets, sidewalks and other public spaces requires real - time visual - language understanding ability. This includes not only recognizing objects, but also understanding the scene and issuing alarms in a timely manner. - The paper meets this need by integrating LLMs and real - time object detection models, ensuring that users can receive immediate safety tips. 3. **Dynamic Scene Switching and Interest Setting**: - Traditional visual navigation systems have limitations when switching between different scenes. The framework proposed in the paper can dynamically adjust object detection categories, switch scenes according to user needs, and allow users to interact with the LLM module to set priority tasks (for example, finding the nearest bench). 4. **Low - Latency Real - Time Feedback**: - Real - time visual navigation requires extremely low latency in order to respond quickly in complex scenes. The paper optimizes the system so that it can operate with very low latency, ensuring real - time feedback. 5. **Safe Applications of Visual - Language Understanding**: - Although past research has explored the application of LLMs in visual assistance and navigation, few studies have specifically focused on safety issues. The paper fills this gap, focusing on using visual - language understanding techniques to improve the safety of navigation. ### Summary By combining the capabilities of real - time object detection and large language models, this paper proposes a new framework for providing safe visual navigation assistance for the visually impaired in dynamic environments. This framework can not only identify potential obstacles and dangers, but also generate personalized scene descriptions and safety notifications, thus ensuring that users navigate safely in complex environments.

VisionGPT: LLM-Assisted Real-Time Anomaly Detection for Safe Visual Navigation

L3MVN: Leveraging Large Language Models for Visual Target Navigation

Vision-Language Models Assisted Unsupervised Video Anomaly Detection

Do LLMs Understand Visual Anomalies? Uncovering LLM's Capabilities in Zero-shot Anomaly Detection

Open-Nav: Exploring Zero-Shot Vision-and-Language Navigation in Continuous Environment with Open-Source LLMs

Multimodal Large Language Model for Visual Navigation

VLN-Game: Vision-Language Equilibrium Search for Zero-Shot Semantic Navigation

Hard Cases Detection in Motion Prediction by Vision-Language Foundation Models

Seeing is Believing? Enhancing Vision-Language Navigation using Visual Perturbations

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

Safety Alignment for Vision Language Models

Zero-Shot Vision-and-Language Navigation with Collision Mitigation in Continuous Environment

Human-Free Automated Prompting for Vision-Language Anomaly Detection: Prompt Optimization with Meta-guiding Prompt Scheme

VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection

ImagineNav: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

Vision and Language Navigation in the Real World via Online Visual Language Mapping

VLAI: Exploration and exploitation based on visual-language aligned information for robotic object goal navigation

ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time

Harnessing Large Language Models for Training-free Video Anomaly Detection

NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models

Using Multimodal Large Language Models for Automated Detection of Traffic Safety Critical Events