Language-Enhanced Latent Representations for Out-of-Distribution Detection in Autonomous Driving

Zhenjiang Mao,Dong-You Jhong,Ao Wang,Ivan Ruchkin
2024-05-03
Abstract:Out-of-distribution (OOD) detection is essential in autonomous driving, to determine when learning-based components encounter unexpected inputs. Traditional detectors typically use encoder models with fixed settings, thus lacking effective human interaction capabilities. With the rise of large foundation models, multimodal inputs offer the possibility of taking human language as a latent representation, thus enabling language-defined OOD detection. In this paper, we use the cosine similarity of image and text representations encoded by the multimodal model CLIP as a new representation to improve the transparency and controllability of latent encodings used for visual anomaly detection. We compare our approach with existing pre-trained encoders that can only produce latent representations that are meaningless from the user's standpoint. Our experiments on realistic driving data show that the language-based latent representation performs better than the traditional representation of the vision encoder and helps improve the detection performance when combined with standard representations.
Computer Vision and Pattern Recognition,Machine Learning,Robotics
What problem does this paper attempt to address?
This paper aims to solve a key problem in the field of autonomous driving: how to effectively detect out - of - distribution (OOD) data. Specifically, traditional OOD detection methods usually use encoder models with fixed settings. These models lack the ability to interact effectively with humans and cannot be adjusted according to the specific needs of users. With the rise of large - scale foundation models, multi - modal input provides the possibility of using human language as a latent representation, thereby achieving language - defined OOD detection. This paper proposes a new method to improve the transparency and controllability of latent encodings for visual anomaly detection by calculating the cosine similarity between image and text representations. This method not only improves the detection performance but also enhances the user's trust and control over the system. The main contributions of the paper include: 1. Proposing a new language - guided OOD detection technique, enabling end - users to obtain more transparency and control. 2. Conducting extensive experiments on photo - realistic simulation data to evaluate the performance of different language encodings in OOD detection. Through this method, users can specify the phenomena they care about. For example, a driver can specify that the vehicle is expected to see a clear, bright and open road, and any deviation from this standard should be regarded as an OOD input. This ability greatly improves the flexibility and transparency of anomaly detection, which is especially important from the perspective of end - users.