Is CLIP the main roadblock for fine-grained open-world perception?

Lorenzo Bianchi,Fabio Carrara,Nicola Messina,Fabrizio Falchi
2024-04-04
Abstract:Modern applications increasingly demand flexible computer vision models that adapt to novel concepts not encountered during training. This necessity is pivotal in emerging domains like extended reality, robotics, and autonomous driving, which require the ability to respond to open-world stimuli. A key ingredient is the ability to identify objects based on free-form textual queries defined at inference time - a task known as open-vocabulary object detection. Multimodal backbones like CLIP are the main enabling technology for current open-world perception solutions. Despite performing well on generic queries, recent studies highlighted limitations on the fine-grained recognition capabilities in open-vocabulary settings - i.e., for distinguishing subtle object features like color, shape, and material. In this paper, we perform a detailed examination of these open-vocabulary object recognition limitations to find the root cause. We evaluate the performance of CLIP, the most commonly used vision-language backbone, against a fine-grained object-matching benchmark, revealing interesting analogies between the limitations of open-vocabulary object detectors and their backbones. Experiments suggest that the lack of fine-grained understanding is caused by the poor separability of object characteristics in the CLIP latent space. Therefore, we try to understand whether fine-grained knowledge is present in CLIP embeddings but not exploited at inference time due, for example, to the unsuitability of the cosine similarity matching function, which may discard important object characteristics. Our preliminary experiments show that simple CLIP latent-space re-projections help separate fine-grained concepts, paving the way towards the development of backbones inherently able to process fine-grained details. The code for reproducing these experiments is available at
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the limitations of current Open-Vocabulary Object Detection (OVD) models in fine-grained recognition capabilities. Specifically, it focuses on the poor performance of the multimodal model CLIP in handling fine-grained attributes such as color, shape, and material. ### Background and Motivation Modern applications (such as extended reality, robotics, and autonomous driving) require flexible computer vision models that can adapt to new concepts not encountered during training. Open-Vocabulary Object Detection (OVD) is a key task to achieve this goal, requiring models to identify objects based on free-form text queries during inference. However, despite the good performance of multimodal models like CLIP on general queries, recent studies have shown that they have limitations in fine-grained recognition. ### Research Questions The paper mainly explores two key questions: 1. **Is the embedding space of CLIP the main reason for the lack of fine-grained understanding in open-vocabulary object detection?** 2. **If the embedding space of CLIP is indeed problematic, is it because the embedding space itself lacks fine-grained information, or because existing matching methods (such as cosine similarity) cannot effectively extract this information?** ### Methods To answer these questions, the authors adopted the following methods: 1. **CLIP Fine-Grained Evaluation**: Using the Fine-Grained Open-Vocabulary Object Detection (FG-OVD) benchmark dataset to evaluate CLIP's performance in fine-grained object recognition. The performance of CLIP in fine-grained tasks is assessed by calculating the cosine similarity between cropped image embeddings and text embeddings. 2. **Latent Space Characteristics and Matching Methods**: Assuming that the embedding space of CLIP does contain fine-grained information, the authors attempt to learn a custom similarity function \( S(v, t) \) to explore whether more complex matching methods can extract this information. Specific methods include linear projection layers, multilayer perceptrons (MLP), and multi-head attention mechanisms (MHA). ### Experimental Results 1. **Comparison of CLIP and OWLV2**: Experimental results show that CLIP's performance on fine-grained tasks is similar to that of CLIP-based open-vocabulary object detectors (such as OWLV2), but overall performance is lower. This indicates that the challenge of fine-grained recognition lies more in image-text alignment rather than object localization. 2. **Effectiveness of Linear Projection**: By fine-tuning on the fine-grained dataset, the authors found that a simple linear projection layer can significantly improve the performance of fine-grained matching without significantly affecting the performance of coarse-grained tasks. This suggests that the embedding space of CLIP does contain fine-grained information, but existing matching methods have not effectively utilized this information. 3. **Effect of Nonlinear Methods**: Although nonlinear methods (such as MLP and MHA) perform slightly better on fine-grained tasks, they lead to a decline in performance on coarse-grained tasks. This indicates that linear methods are more effective in balancing between coarse-grained and fine-grained tasks. ### Conclusion Through detailed experimental analysis, the paper reveals the limitations of CLIP in fine-grained object recognition and proposes the feasibility of improving fine-grained matching through simple linear projection methods. These findings provide new insights for developing more efficient multimodal models.