DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models

Yuhang Cao,Pan Zhang,Xiaoyi Dong,Dahua Lin,Jiaqi Wang
2024-02-23
Abstract:We present DualFocus, a novel framework for integrating macro and micro perspectives within multi-modal large language models (MLLMs) to enhance vision-language task performance. Current MLLMs typically singularly focus on inputs at a predefined resolution, resulting in deficiencies in detailed questions involving local regions. We introduced a DualFocus mechanism where the model concentrates on the image from a macro perspective, responses to the question, and identifies suitable sub-regions to zoom in for subsequent micro perspective analysis. Via the integration of answers from both macro and micro perspectives, the model is adept at addressing tasks that encompass global, detailed, and combined considerations. To endows the DualFocus mechanism in MLLMs, we curated a tailored dataset derived from the Visual Genome (VG) and adapted it to align with the training regimen of DualFocus. Through comparative studies across different model sizes and benchmarks, we demonstrate DualFocus's superiority in balancing detailed examination with holistic insight, significantly reducing hallucination instances in MLLMs and improving their performance in various vision-language tasks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how multimodal large language models (MLLMs) can effectively integrate macro and micro perspectives when handling visual - language tasks, in order to improve the model's ability to understand local details and global context. Current MLLMs are usually only able to focus on inputs of pre - defined resolutions, which leads to their deficiencies when answering detailed questions involving local areas. The paper proposes a new framework named DualFocus. By allowing the model to first analyze the entire image from a macro perspective and then identify and magnify the sub - regions of interest for micro - analysis, this problem can be solved. This method aims to balance detailed local examination and overall insight, reduce the occurrence of hallucination instances in the model, and improve its performance in various visual - language tasks.