OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Tao Zhang,Xiangtai Li,Hao Fei,Haobo Yuan,Shengqiong Wu,Shunping Ji,Chen Change Loy,Shuicheng Yan
2024-10-01
Abstract:Current universal segmentation methods demonstrate strong capabilities in pixel-level image and video understanding. However, they lack reasoning abilities and cannot be controlled via text instructions. In contrast, large vision-language multimodal models exhibit powerful vision-based conversation and reasoning capabilities but lack pixel-level understanding and have difficulty accepting visual prompts for flexible user interaction. This paper proposes OMG-LLaVA, a new and elegant framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction. Specifically, we use a universal segmentation method as the visual encoder, integrating image information, perception priors, and visual prompts into visual tokens provided to the LLM. The LLM is responsible for understanding the user's text instructions and providing text responses and pixel-level segmentation results based on the visual information. We propose perception prior embedding to better integrate perception priors with image features. OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model, matching or surpassing the performance of specialized methods on multiple benchmarks. Rather than using LLM to connect each specialist, our work aims at end-to-end training on one encoder, one decoder, and one LLM. The code and model have been released for further research.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to bridge the gap between image - level, object - level and pixel - level understanding and reasoning in current multi - modal models. Specifically: 1. **Limitations of existing models**: - **General segmentation methods**: Although they perform well in pixel - level image and video understanding, they lack reasoning ability and cannot be controlled by text instructions. - **Large visual - language multi - modal models (LLM)**: Although they perform well in vision - based dialogue and reasoning, they are insufficient in pixel - level understanding and it is difficult to accept visual cues for flexible user interaction. 2. **Objectives**: - **Unify understanding and reasoning at three levels**: Integrate image - level, object - level and pixel - level understanding and reasoning tasks into one model. - **Flexibility**: Be able to accept various visual and text cues and support flexible user interaction. - **Simplify system architecture**: Use one visual encoder, one decoder and one large language model (LLM), instead of relying on multiple specialized models or complex system designs. 3. **Solutions**: - Propose the OMG - LLaV A framework, which combines strong pixel - level visual understanding and reasoning abilities. - Use general segmentation methods as visual encoders to integrate image information, perceptual priors and visual cues into visual tokens and provide them to the LLM. - The LLM is responsible for understanding the user's text instructions and generating text responses and pixel - level segmentation results based on visual information. - Propose a perception prior embedding module to better combine perceptual priors with image features. 4. **Contributions**: - Achieve the simultaneous processing of image - level, object - level and pixel - level understanding and reasoning tasks in one model. - Match or exceed the performance of specialized methods in multiple benchmark tests. - Provide an end - to - end training framework instead of relying on the connection of multiple specialized models. ### Formula summary - **Perception prior embedding formula**: \[ MS=\text{Softmax}(M\odot S,\text{dim} = - 1) \] \[ T_{pv}=MS\cdot Q + F \] - **Pre - training loss function**: \[ L_{\text{pretrain}}=L_{\text{text}}+L_{\text{reg}},\quad L_{\text{reg}}=(T_{ov}-P_t(P_v(T_{ov})))^2 \] - **Instruction tuning loss function**: \[ L_{\text{instruction}}=L_{\text{text}}+L_{\text{mask}},\quad L_{\text{mask}}=\alpha L_{CE}+\beta L_{DICE} \] Through these methods, OMG - LLaV A can achieve multiple levels of visual understanding and reasoning tasks while maintaining a simple and elegant design.