Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models

Yang Jiao,Shaoxiang Chen,Zequn Jie,Jingjing Chen,Lin Ma,Yu-Gang Jiang

2024-05-28

Abstract:Large Multimodal Model (LMM) is a hot research topic in the computer vision area and has also demonstrated remarkable potential across multiple disciplinary fields. A recent trend is to further extend and enhance the perception capabilities of LMMs. The current methods follow the paradigm of adapting the visual task outputs to the format of the language model, which is the main component of a LMM. This adaptation leads to convenient development of such LMMs with minimal modifications, however, it overlooks the intrinsic characteristics of diverse visual tasks and hinders the learning of perception capabilities. To address this issue, we propose a novel LMM architecture named Lumen, a Large multimodal model with versatile vision-centric capability enhancement. We decouple the LMM's learning of perception capabilities into task-agnostic and task-specific stages. Lumen first promotes fine-grained vision-language concept alignment, which is the fundamental capability for various visual tasks. Thus the output of the task-agnostic stage is a shared representation for all the tasks we address in this paper. Then the task-specific decoding is carried out by flexibly routing the shared representation to lightweight task decoders with negligible training efforts. Comprehensive experimental results on a series of vision-centric and VQA benchmarks indicate that our Lumen model not only achieves or surpasses the performance of existing LMM-based approaches in a range of vision-centric tasks while maintaining general visual understanding and instruction following capabilities. The code will be released at <a class="link-external link-https" href="https://github.com/SxJyJay/Lumen" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

This paper proposes a solution to the problem of limited capacity of large-scale multimodal models (LMM) in visual tasks. Current methods adapt the output of visual tasks to the format of language models, but this approach ignores the inherent characteristics of different visual tasks, hindering the learning of perceptual abilities. Therefore, the paper introduces a new LMM architecture called Lumen, which divides the learning of perceptual abilities into two stages: task-agnostic and task-specific. Firstly, Lumen promotes fine-grained alignment of visual-language concepts, which serves as the foundation for various visual tasks. Then, it flexibly routes shared representations to lightweight task decoders for task-specific decoding with minimal training effort. Experimental results demonstrate that Lumen achieves or surpasses the performance of LMM-based methods on a range of visual-centric tasks and VQA benchmarks, while maintaining general visual understanding and instruction-following capabilities.

Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models

ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

InfMLLM: A Unified Framework for Visual-Language Tasks.

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Towards Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Are We on the Right Way for Evaluating Large Vision-Language Models?

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

Multi-modal Auto-regressive Modeling via Visual Words

Enhancing Perception Capabilities of Multimodal LLMs with Training-free Fusion

Panther: Illuminate the Sight of Multimodal LLMs with Instruction-Guided Visual Prompts

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

LightLLM: A Versatile Large Language Model for Predictive Light Sensing

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark