Abstract:There is a gap in the understanding of occluded objects in existing large-scale visual language multi-modal models. Current state-of-the-art multimodal models fail to provide satisfactory results in describing occluded objects for visual-language multimodal models through universal visual encoders. Another challenge is the limited number of datasets containing image-text pairs with a large number of occluded objects. Therefore, we introduce a novel multimodal model that applies a newly designed visual encoder to understand occluded objects in RGB images. We also introduce a large-scale visual-language pair dataset for training large-scale visual-language multimodal models and understanding occluded objects. We start our experiments comparing with the state-of-the-art models.

What problem does this paper attempt to address?

This paper aims to address the deficiencies of existing large - scale visual - language multimodal models in understanding and describing occluded objects. Specifically, the current state - of - the - art multimodal models cannot provide satisfactory results when describing occluded objects through general visual encoders, and lack datasets of image - text pairs that contain a large number of occluded objects. Therefore, the authors propose a new multimodal model OCC - MLLM, as well as a large - scale visual - language pair dataset, to improve the model's ability to understand occluded objects. ### Main problems 1. **Deficiencies in understanding occluded objects**: Existing multimodal models perform poorly when dealing with occluded objects, especially when describing these objects. 2. **Limitations of datasets**: The lack of datasets of image - text pairs that contain a large number of occluded objects limits the training and performance improvement of the model. ### Solutions 1. **Proposing the OCC - MLLM model**: This model better understands occluded objects by designing a new visual encoder module. Specifically, it includes: - **Dual - visual - encoder module**: Combining the CLIP model and the 3D model to extract ordinary visual features and visual features of occluded objects respectively. - **Visual embedding of occluded objects**: Using the 3D reconstruction method to generate 2D representations of occluded objects and extract visual embeddings through the CLIP model. 2. **Constructing a large - scale dataset**: A dataset containing 600,000 image - text pairs was created for training and evaluating the model. This dataset pays special attention to the description of occluded objects. ### Experiments and results - **Comparative experiments**: The GPT4v and MiniGPT4 - V2 models were tested on the proposed OCC - HO dataset, and the results showed that these models have low accuracy in describing occluded objects. - **SDF encoder experiments**: By pre - training and fine - tuning the SDF encoder, the accuracy of the model in describing occluded objects was significantly improved, especially in terms of identifying object categories. ### Future work - **Further fine - tuning the SDF encoder**: Continue to optimize the performance of the SDF encoder on other tasks (such as shape, length, and thickness judgment). - **Integrating classic large models**: Combine the SDF encoder with classic large - language models to provide more comprehensive descriptions of occluded objects. - **Application of the dual - visual - encoder module**: Combine the SDF encoder and the CLIP encoder and apply them to classic multimodal large - language models to further improve model performance. Through these methods, the paper hopes to achieve better understanding and description of occluded objects in multimodal models.

OCC-MLLM:Empowering Multimodal Large Language Model For the Understanding of Occluded Objects

OCC-MLLM-Alpha:Empowering Multi-modal Large Language Model for the Understanding of Occluded Objects with Self-Supervised Test-Time Learning

Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models

Probing Multimodal Large Language Models for Global and Local Semantic Representations

Ovis: Structural Embedding Alignment for Multimodal Large Language Model

On the Hidden Mystery of OCR in Large Multimodal Models

InfMLLM: A Unified Framework for Visual-Language Tasks.

Explaining Multi-modal Large Language Models by Analyzing their Vision Perception

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

MAMO: Fine-Grained Vision-Language Representations Learning with Masked Multimodal Modeling

Dense Connector for MLLMs

Improving Visual Storytelling with Multimodal Large Language Models

Towards More Unified In-context Visual Understanding

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

Face-MLLM: A Large Face Perception Model

CompCap: Improving Multimodal Large Language Models with Composite Captions

Multimodal Large Language Models: A Survey

Unified Generative and Discriminative Training for Multi-modal Large Language Models

MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey