Abstract:Current universal segmentation methods demonstrate strong capabilities in pixel-level image and video understanding. However, they lack reasoning abilities and cannot be controlled via text instructions. In contrast, large vision-language multimodal models exhibit powerful vision-based conversation and reasoning capabilities but lack pixel-level understanding and have difficulty accepting visual prompts for flexible user interaction. This paper proposes OMG-LLaVA, a new and elegant framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction. Specifically, we use a universal segmentation method as the visual encoder, integrating image information, perception priors, and visual prompts into visual tokens provided to the LLM. The LLM is responsible for understanding the user's text instructions and providing text responses and pixel-level segmentation results based on the visual information. We propose perception prior embedding to better integrate perception priors with image features. OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model, matching or surpassing the performance of specialized methods on multiple benchmarks. Rather than using LLM to connect each specialist, our work aims at end-to-end training on one encoder, one decoder, and one LLM. The code and model have been released for further research.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to bridge the gap between image - level, object - level and pixel - level understanding and reasoning in current multi - modal models. Specifically: 1. **Limitations of existing models**: - **General segmentation methods**: Although they perform well in pixel - level image and video understanding, they lack reasoning ability and cannot be controlled by text instructions. - **Large visual - language multi - modal models (LLM)**: Although they perform well in vision - based dialogue and reasoning, they are insufficient in pixel - level understanding and it is difficult to accept visual cues for flexible user interaction. 2. **Objectives**: - **Unify understanding and reasoning at three levels**: Integrate image - level, object - level and pixel - level understanding and reasoning tasks into one model. - **Flexibility**: Be able to accept various visual and text cues and support flexible user interaction. - **Simplify system architecture**: Use one visual encoder, one decoder and one large language model (LLM), instead of relying on multiple specialized models or complex system designs. 3. **Solutions**: - Propose the OMG - LLaV A framework, which combines strong pixel - level visual understanding and reasoning abilities. - Use general segmentation methods as visual encoders to integrate image information, perceptual priors and visual cues into visual tokens and provide them to the LLM. - The LLM is responsible for understanding the user's text instructions and generating text responses and pixel - level segmentation results based on visual information. - Propose a perception prior embedding module to better combine perceptual priors with image features. 4. **Contributions**: - Achieve the simultaneous processing of image - level, object - level and pixel - level understanding and reasoning tasks in one model. - Match or exceed the performance of specialized methods in multiple benchmark tests. - Provide an end - to - end training framework instead of relying on the connection of multiple specialized models. ### Formula summary - **Perception prior embedding formula**: \[ MS=\text{Softmax}(M\odot S,\text{dim} = - 1) \] \[ T_{pv}=MS\cdot Q + F \] - **Pre - training loss function**: \[ L_{\text{pretrain}}=L_{\text{text}}+L_{\text{reg}},\quad L_{\text{reg}}=(T_{ov}-P_t(P_v(T_{ov})))^2 \] - **Instruction tuning loss function**: \[ L_{\text{instruction}}=L_{\text{text}}+L_{\text{mask}},\quad L_{\text{mask}}=\alpha L_{CE}+\beta L_{DICE} \] Through these methods, OMG - LLaV A can achieve multiple levels of visual understanding and reasoning tasks while maintaining a simple and elegant design.

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

ViLLa: Video Reasoning Segmentation with Large Language Model

TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

HyperSeg: Towards Universal Visual Segmentation with Large Language Model

ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

InfMLLM: A Unified Framework for Visual-Language Tasks.

From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

Empowering Segmentation Ability to Multi-modal Large Language Models

Visually-Augmented Language Modeling

Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models

LLaVA-OneVision: Easy Visual Task Transfer

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step