OCC-MLLM:Empowering Multimodal Large Language Model For the Understanding of Occluded Objects

Wenmo Qiu,Xinhan Di
2024-10-02
Abstract:There is a gap in the understanding of occluded objects in existing large-scale visual language multi-modal models. Current state-of-the-art multimodal models fail to provide satisfactory results in describing occluded objects for visual-language multimodal models through universal visual encoders. Another challenge is the limited number of datasets containing image-text pairs with a large number of occluded objects. Therefore, we introduce a novel multimodal model that applies a newly designed visual encoder to understand occluded objects in RGB images. We also introduce a large-scale visual-language pair dataset for training large-scale visual-language multimodal models and understanding occluded objects. We start our experiments comparing with the state-of-the-art models.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper aims to address the deficiencies of existing large - scale visual - language multimodal models in understanding and describing occluded objects. Specifically, the current state - of - the - art multimodal models cannot provide satisfactory results when describing occluded objects through general visual encoders, and lack datasets of image - text pairs that contain a large number of occluded objects. Therefore, the authors propose a new multimodal model OCC - MLLM, as well as a large - scale visual - language pair dataset, to improve the model's ability to understand occluded objects. ### Main problems 1. **Deficiencies in understanding occluded objects**: Existing multimodal models perform poorly when dealing with occluded objects, especially when describing these objects. 2. **Limitations of datasets**: The lack of datasets of image - text pairs that contain a large number of occluded objects limits the training and performance improvement of the model. ### Solutions 1. **Proposing the OCC - MLLM model**: This model better understands occluded objects by designing a new visual encoder module. Specifically, it includes: - **Dual - visual - encoder module**: Combining the CLIP model and the 3D model to extract ordinary visual features and visual features of occluded objects respectively. - **Visual embedding of occluded objects**: Using the 3D reconstruction method to generate 2D representations of occluded objects and extract visual embeddings through the CLIP model. 2. **Constructing a large - scale dataset**: A dataset containing 600,000 image - text pairs was created for training and evaluating the model. This dataset pays special attention to the description of occluded objects. ### Experiments and results - **Comparative experiments**: The GPT4v and MiniGPT4 - V2 models were tested on the proposed OCC - HO dataset, and the results showed that these models have low accuracy in describing occluded objects. - **SDF encoder experiments**: By pre - training and fine - tuning the SDF encoder, the accuracy of the model in describing occluded objects was significantly improved, especially in terms of identifying object categories. ### Future work - **Further fine - tuning the SDF encoder**: Continue to optimize the performance of the SDF encoder on other tasks (such as shape, length, and thickness judgment). - **Integrating classic large models**: Combine the SDF encoder with classic large - language models to provide more comprehensive descriptions of occluded objects. - **Application of the dual - visual - encoder module**: Combine the SDF encoder and the CLIP encoder and apply them to classic multimodal large - language models to further improve model performance. Through these methods, the paper hopes to achieve better understanding and description of occluded objects in multimodal models.