Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models

Yang Jiao,Shaoxiang Chen,Zequn Jie,Jingjing Chen,Lin Ma,Yu-Gang Jiang
2024-05-28
Abstract:Large Multimodal Model (LMM) is a hot research topic in the computer vision area and has also demonstrated remarkable potential across multiple disciplinary fields. A recent trend is to further extend and enhance the perception capabilities of LMMs. The current methods follow the paradigm of adapting the visual task outputs to the format of the language model, which is the main component of a LMM. This adaptation leads to convenient development of such LMMs with minimal modifications, however, it overlooks the intrinsic characteristics of diverse visual tasks and hinders the learning of perception capabilities. To address this issue, we propose a novel LMM architecture named Lumen, a Large multimodal model with versatile vision-centric capability enhancement. We decouple the LMM's learning of perception capabilities into task-agnostic and task-specific stages. Lumen first promotes fine-grained vision-language concept alignment, which is the fundamental capability for various visual tasks. Thus the output of the task-agnostic stage is a shared representation for all the tasks we address in this paper. Then the task-specific decoding is carried out by flexibly routing the shared representation to lightweight task decoders with negligible training efforts. Comprehensive experimental results on a series of vision-centric and VQA benchmarks indicate that our Lumen model not only achieves or surpasses the performance of existing LMM-based approaches in a range of vision-centric tasks while maintaining general visual understanding and instruction following capabilities. The code will be released at <a class="link-external link-https" href="https://github.com/SxJyJay/Lumen" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper proposes a solution to the problem of limited capacity of large-scale multimodal models (LMM) in visual tasks. Current methods adapt the output of visual tasks to the format of language models, but this approach ignores the inherent characteristics of different visual tasks, hindering the learning of perceptual abilities. Therefore, the paper introduces a new LMM architecture called Lumen, which divides the learning of perceptual abilities into two stages: task-agnostic and task-specific. Firstly, Lumen promotes fine-grained alignment of visual-language concepts, which serves as the foundation for various visual tasks. Then, it flexibly routes shared representations to lightweight task decoders for task-specific decoding with minimal training effort. Experimental results demonstrate that Lumen achieves or surpasses the performance of LMM-based methods on a range of visual-centric tasks and VQA benchmarks, while maintaining general visual understanding and instruction-following capabilities.