Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Jiuhai Chen,Jianwei Yang,Haiping Wu,Dianqi Li,Jianfeng Gao,Tianyi Zhou,Bin Xiao
2024-12-06
Abstract:We present Florence-VL, a new family of multimodal large language models (MLLMs) with enriched visual representations produced by Florence-2, a generative vision foundation model. Unlike the widely used CLIP-style vision transformer trained by contrastive learning, Florence-2 can capture different levels and aspects of visual features, which are more versatile to be adapted to diverse downstream tasks. We propose a novel feature-fusion architecture and an innovative training recipe that effectively integrates Florence-2's visual features into pretrained LLMs, such as Phi 3.5 and LLama 3. In particular, we propose "depth-breath fusion (DBFusion)" to fuse the visual features extracted from different depths and under multiple prompts. Our model training is composed of end-to-end pretraining of the whole model followed by finetuning of the projection layer and the LLM, on a carefully designed recipe of diverse open-source datasets that include high-quality image captions and instruction-tuning pairs. Our quantitative analysis and visualization of Florence-VL's visual features show its advantages over popular vision encoders on vision-language alignment, where the enriched depth and breath play important roles. Florence-VL achieves significant improvements over existing state-of-the-art MLLMs across various multi-modal and vision-centric benchmarks covering general VQA, perception, hallucination, OCR, Chart, knowledge-intensive understanding, etc. To facilitate future research, our models and the complete training recipe are open-sourced. <a class="link-external link-https" href="https://github.com/JiuhaiChen/Florence-VL" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of existing Vision - Language Models (VLMs) in terms of visual encoders. Specifically, the currently widely - used Transformer - based visual encoders, such as CLIP or SigLIP, are effective but they usually can only provide image - level semantic representations, ignoring pixel - level or region - level details and low - level features, and these details are crucial for many downstream tasks. In addition, in order to utilize the unique representations of different visual encoders, some studies have adopted a hybrid approach of multiple visual encoders, but this increases the computational cost of model training and deployment. For this reason, the paper proposes Florence - VL, which is a new family of Multimodal Large Language Models (MLLMs). It uses the generative visual foundation model Florence - 2 as a visual encoder. Florence - 2 can generate diverse visual features through different task prompts (such as image captioning, OCR, localization, etc.), thus adapting to the requirements of various downstream tasks. To integrate these visual features more effectively, the paper proposes a novel feature - fusion architecture - Depth - Breadth Fusion (DBFusion). This method can extract visual features from different depths and multiple prompts and effectively combine them with pre - trained language models (such as Phi 3.5 and LLama 3). Through this method, Florence - VL has achieved significant performance improvements in multiple multimodal and vision - centric benchmark tests, covering tasks such as general - purpose Visual Question Answering (VQA), perception, hallucination, OCR, and chart understanding. In addition, the paper also provides detailed experimental analysis and visualization results, proving that the visual representation of Florence - VL is superior to popular visual encoders such as CLIP and SigLIP in terms of vision - language alignment. To promote future research, the authors have made the model and the complete training scheme publicly available.