Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving

Hao Zhou,Zhanning Gao,Maosheng Ye,Zhili Chen,Qifeng Chen,Tongyi Cao,Honggang Qi
2024-11-20
Abstract:In light of the dynamic nature of autonomous driving environments and stringent safety requirements, general MLLMs combined with CLIP alone often struggle to represent driving-specific scenarios accurately, particularly in complex interactions and long-tail cases. To address this, we propose the Hints of Prompt (HoP) framework, which introduces three key enhancements: Affinity hint to emphasize instance-level structure by strengthening token-wise connections, Semantic hint to incorporate high-level information relevant to driving-specific cases, such as complex interactions among vehicles and traffic signs, and Question hint to align visual features with the query context, focusing on question-relevant regions. These hints are fused through a Hint Fusion module, enriching visual representations and enhancing multimodal reasoning for autonomous driving VQA tasks. Extensive experiments confirm the effectiveness of the HoP framework, showing it significantly outperforms previous state-of-the-art methods across all key metrics.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve the problem of poor performance of multi - modal large - language models (MLLMs) in visual question - answering (VQA) tasks in the autonomous driving environment. Specifically, the existing MLLMs combined with CLIP face the following challenges when dealing with specific driving scenarios: 1. **Insufficient understanding of instance - level structures**: Existing visual encoders such as CLIP cannot well capture the association information between instances when dealing with driving scenarios, which limits the model's understanding of complex interactions and long - tail situations. 2. **Insufficient domain - specific semantic information**: General - purpose visual encoders lack the ability to represent specific semantic information in the field of autonomous driving and are prone to overlook some crucial but small elements, such as distant vehicles, pedestrians and traffic signs. 3. **Insufficient question relevance**: When dealing with VQA tasks, the current MLLMs process visual and textual information separately, causing the model to have difficulty focusing on the image areas relevant to the question and reducing the effectiveness of context - aware responses. To solve these problems, the paper proposes the "Hints of Prompt (HoP)" framework to enhance visual representation by introducing three levels of prompts: - **Affinity hint**: Strengthen the understanding of instance - level structures by strengthening the connections between tokens. - **Semantic hint**: Introduce specific instances and their category information to add driving - related context information. - **Question hint**: Guide the model to focus on the image areas relevant to the question. These prompts are integrated through a simple Hint Fusion module, and then the MLLM generates the answer. Experimental results show that the HoP framework significantly improves the performance of VQA tasks on multiple benchmark datasets, especially in complex driving scenarios.