Abstract:Recent multimodal large language models (MLLM) such as GPT-4o and GPT-4v have shown great potential in autonomous driving. In this paper, we propose a cross-domain few-shot in-context learning method based on the MLLM for enhancing traffic sign recognition (TSR). We first construct a traffic sign detection network based on Vision Transformer Adapter and an extraction module to extract traffic signs from the original road images. To reduce the dependence on training data and improve the performance stability of cross-country TSR, we introduce a cross-domain few-shot in-context learning method based on the MLLM. To enhance MLLM's fine-grained recognition ability of traffic signs, the proposed method generates corresponding description texts using template traffic signs. These description texts contain key information about the shape, color, and composition of traffic signs, which can stimulate the ability of MLLM to perceive fine-grained traffic sign categories. By using the description texts, our method reduces the cross-domain differences between template and real traffic signs. Our approach requires only simple and uniform textual indications, without the need for large-scale traffic sign images and labels. We perform comprehensive evaluations on the German traffic sign recognition benchmark dataset, the Belgium traffic sign dataset, and two real-world datasets taken from Japan. The experimental results show that our method significantly enhances the TSR performance.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the issue of cross-domain few-shot learning in traffic sign recognition (TSR). Specifically, the researchers propose a cross-domain few-shot contextual learning method based on a multimodal large language model (MLLM) to improve the performance of traffic sign recognition. The main problems include: 1. **Data Dependency**: Traditional TSR methods require a large amount of country-specific data for training, which may be limited in practical applications. 2. **Cross-Domain Differences**: Traffic signs in different countries have visual differences, leading to unstable performance of existing TSR methods when applied across countries. 3. **Fine-Grained Recognition**: There are numerous categories of traffic signs with similar features, requiring fine-grained recognition capabilities. ### Solutions To address the above problems, the paper proposes the following solutions: 1. **Traffic Sign Detection Network**: A traffic sign detection network based on Vision Transformer Adapter (ViT-Adapter) is constructed to extract traffic signs from raw road images. 2. **Cross-Domain Few-Shot Contextual Learning**: A cross-domain few-shot contextual learning method based on MLLM is introduced, which reduces the cross-domain differences between template traffic signs and actual traffic signs by generating descriptive texts. 3. **Descriptive Text Generation**: Descriptive texts containing key information such as shape, color, and composition are generated using template traffic signs to enhance the fine-grained recognition capability of MLLM for traffic signs. ### Experimental Results The researchers conducted experiments on the German Traffic Sign Recognition Benchmark (GTSRB), the Belgian Traffic Sign Dataset (BTSD), and two real-world road datasets from Japan (Sapporo City Road Dataset and Yokohama City Road Dataset). The experimental results show that the proposed method significantly outperforms traditional methods and CNN-based methods in terms of Top-1, Top-5, and Top-10 accuracy. Notably, the GPT-4o version of MLLM demonstrated excellent Top-1 accuracy across all datasets, proving the effectiveness and potential application prospects of the method. ### Conclusion By combining multimodal large language models and few-shot learning techniques, the paper proposes an effective cross-domain traffic sign recognition method. This method not only reduces the dependency on large-scale training data but also improves performance stability when applied in different countries and regions.

Cross-domain Few-shot In-context Learning for Enhancing Traffic Sign Recognition

Think Twice Before Recognizing: Large Multimodal Models for General Fine-grained Traffic Sign Recognition

Traffic Sign Recognition from Digital Images by Using Deep Learning

An Attention Based YOLOv5 Network for Small Traffic Sign Recognition

The Improved Framework for Traffic Sign Recognition Using Guided Image Filtering

Traffic sign recognition based on deep learning

TSCLIP: Robust CLIP Fine-Tuning for Worldwide Cross-Regional Traffic Sign Recognition

Road Traffic Sign Detection Method Based on RTS R-CNN Instance Segmentation Network

Exploring Explainable Artificial Intelligence Techniques for Interpretable Neural Networks in Traffic Sign Recognition Systems

MambaTSR: You only need 90k parameters for traffic sign recognition

Sustainable and Transferable Traffic Sign Recognition for Intelligent Transportation Systems

Traffic signs detection and recognition systems by light-weight multi-stage network

Traffic Signs Detection and Recognition System using Deep Learning

A feature‐enhanced hybrid attention network for traffic sign recognition in real scenes

Traffic Sign Detection by ROI Extraction and Histogram Features-Based Recognition

Revolutionizing Traffic Sign Recognition: Unveiling the Potential of Vision Transformers

Enhancing traffic sign recognition (TSR) by classifying deep learning models to promote road safety

Traffic Sign Recognition With Lightweight Two-Stage Model in Complex Scenes

Study on Traffic Sign Recognition by Optimized Lenet-5 Algorithm

Traffic Sign Recognition with Deep Learning: Vegetation Occlusion Detection in Brazilian Environments

Real-Time Traffic Sign Recognition Using Deep Learning