Abstract:In light of the dynamic nature of autonomous driving environments and stringent safety requirements, general MLLMs combined with CLIP alone often struggle to represent driving-specific scenarios accurately, particularly in complex interactions and long-tail cases. To address this, we propose the Hints of Prompt (HoP) framework, which introduces three key enhancements: Affinity hint to emphasize instance-level structure by strengthening token-wise connections, Semantic hint to incorporate high-level information relevant to driving-specific cases, such as complex interactions among vehicles and traffic signs, and Question hint to align visual features with the query context, focusing on question-relevant regions. These hints are fused through a Hint Fusion module, enriching visual representations and enhancing multimodal reasoning for autonomous driving VQA tasks. Extensive experiments confirm the effectiveness of the HoP framework, showing it significantly outperforms previous state-of-the-art methods across all key metrics.

What problem does this paper attempt to address?

This paper attempts to solve the problem of poor performance of multi - modal large - language models (MLLMs) in visual question - answering (VQA) tasks in the autonomous driving environment. Specifically, the existing MLLMs combined with CLIP face the following challenges when dealing with specific driving scenarios: 1. **Insufficient understanding of instance - level structures**: Existing visual encoders such as CLIP cannot well capture the association information between instances when dealing with driving scenarios, which limits the model's understanding of complex interactions and long - tail situations. 2. **Insufficient domain - specific semantic information**: General - purpose visual encoders lack the ability to represent specific semantic information in the field of autonomous driving and are prone to overlook some crucial but small elements, such as distant vehicles, pedestrians and traffic signs. 3. **Insufficient question relevance**: When dealing with VQA tasks, the current MLLMs process visual and textual information separately, causing the model to have difficulty focusing on the image areas relevant to the question and reducing the effectiveness of context - aware responses. To solve these problems, the paper proposes the "Hints of Prompt (HoP)" framework to enhance visual representation by introducing three levels of prompts: - **Affinity hint**: Strengthen the understanding of instance - level structures by strengthening the connections between tokens. - **Semantic hint**: Introduce specific instances and their category information to add driving - related context information. - **Question hint**: Guide the model to focus on the image areas relevant to the question. These prompts are integrated through a simple Hint Fusion module, and then the MLLM generates the answer. Experimental results show that the HoP framework significantly improves the performance of VQA tasks on multiple benchmark datasets, especially in complex driving scenarios.

Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving

Prompting Multi-Modal Tokens to Enhance End-to-End Autonomous Driving Imitation Learning with LLMs

Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

HiLM-D: Towards High-Resolution Understanding in Multimodal Large Language Models for Autonomous Driving

UnstrPrompt: Large Language Model Prompt for Driving in Unstructured Scenarios

Integrating Object Detection Modality into Visual Language Model for Enhanced Autonomous Driving Agent

Hint-AD: Holistically Aligned Interpretability in End-to-End Autonomous Driving

World knowledge-enhanced Reasoning Using Instruction-guided Interactor in Autonomous Driving

Enhancing Human-Centered Dynamic Scene Understanding via Multiple LLMs Collaborated Reasoning

Multi-modal Attribute Prompting for Vision-Language Models

OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning

HPT++: Hierarchically Prompting Vision-Language Models with Multi-Granularity Knowledge Generation and Improved Structure Modeling

MiniDrive: More Efficient Vision-Language Models with Multi-Level 2D Features as Text Tokens for Autonomous Driving

Probing Multimodal LLMs as World Models for Driving

Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models

Visual Exemplar Driven Task-Prompting for Unified Perception in Autonomous Driving

AutoHint: Automatic Prompt Optimization with Hint Generation

What is the limitation of multimodal LLMs? A deeper look into multimodal LLMs through prompt probing

PKRD-CoT: A Unified Chain-of-thought Prompting for Multi-Modal Large Language Models in Autonomous Driving

Enhancing Multimodal Large Language Models with Multi-instance Visual Prompt Generator for Visual Representation Enrichment