Cross-domain Few-shot In-context Learning for Enhancing Traffic Sign Recognition

Yaozong Gan,Guang Li,Ren Togo,Keisuke Maeda,Takahiro Ogawa,Miki Haseyama
2024-07-08
Abstract:Recent multimodal large language models (MLLM) such as GPT-4o and GPT-4v have shown great potential in autonomous driving. In this paper, we propose a cross-domain few-shot in-context learning method based on the MLLM for enhancing traffic sign recognition (TSR). We first construct a traffic sign detection network based on Vision Transformer Adapter and an extraction module to extract traffic signs from the original road images. To reduce the dependence on training data and improve the performance stability of cross-country TSR, we introduce a cross-domain few-shot in-context learning method based on the MLLM. To enhance MLLM's fine-grained recognition ability of traffic signs, the proposed method generates corresponding description texts using template traffic signs. These description texts contain key information about the shape, color, and composition of traffic signs, which can stimulate the ability of MLLM to perceive fine-grained traffic sign categories. By using the description texts, our method reduces the cross-domain differences between template and real traffic signs. Our approach requires only simple and uniform textual indications, without the need for large-scale traffic sign images and labels. We perform comprehensive evaluations on the German traffic sign recognition benchmark dataset, the Belgium traffic sign dataset, and two real-world datasets taken from Japan. The experimental results show that our method significantly enhances the TSR performance.
Computer Vision and Pattern Recognition,Artificial Intelligence,Multimedia
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address the issue of cross-domain few-shot learning in traffic sign recognition (TSR). Specifically, the researchers propose a cross-domain few-shot contextual learning method based on a multimodal large language model (MLLM) to improve the performance of traffic sign recognition. The main problems include: 1. **Data Dependency**: Traditional TSR methods require a large amount of country-specific data for training, which may be limited in practical applications. 2. **Cross-Domain Differences**: Traffic signs in different countries have visual differences, leading to unstable performance of existing TSR methods when applied across countries. 3. **Fine-Grained Recognition**: There are numerous categories of traffic signs with similar features, requiring fine-grained recognition capabilities. ### Solutions To address the above problems, the paper proposes the following solutions: 1. **Traffic Sign Detection Network**: A traffic sign detection network based on Vision Transformer Adapter (ViT-Adapter) is constructed to extract traffic signs from raw road images. 2. **Cross-Domain Few-Shot Contextual Learning**: A cross-domain few-shot contextual learning method based on MLLM is introduced, which reduces the cross-domain differences between template traffic signs and actual traffic signs by generating descriptive texts. 3. **Descriptive Text Generation**: Descriptive texts containing key information such as shape, color, and composition are generated using template traffic signs to enhance the fine-grained recognition capability of MLLM for traffic signs. ### Experimental Results The researchers conducted experiments on the German Traffic Sign Recognition Benchmark (GTSRB), the Belgian Traffic Sign Dataset (BTSD), and two real-world road datasets from Japan (Sapporo City Road Dataset and Yokohama City Road Dataset). The experimental results show that the proposed method significantly outperforms traditional methods and CNN-based methods in terms of Top-1, Top-5, and Top-10 accuracy. Notably, the GPT-4o version of MLLM demonstrated excellent Top-1 accuracy across all datasets, proving the effectiveness and potential application prospects of the method. ### Conclusion By combining multimodal large language models and few-shot learning techniques, the paper proposes an effective cross-domain traffic sign recognition method. This method not only reduces the dependency on large-scale training data but also improves performance stability when applied in different countries and regions.