Abstract:We present Florence-VL, a new family of multimodal large language models (MLLMs) with enriched visual representations produced by Florence-2, a generative vision foundation model. Unlike the widely used CLIP-style vision transformer trained by contrastive learning, Florence-2 can capture different levels and aspects of visual features, which are more versatile to be adapted to diverse downstream tasks. We propose a novel feature-fusion architecture and an innovative training recipe that effectively integrates Florence-2's visual features into pretrained LLMs, such as Phi 3.5 and LLama 3. In particular, we propose "depth-breath fusion (DBFusion)" to fuse the visual features extracted from different depths and under multiple prompts. Our model training is composed of end-to-end pretraining of the whole model followed by finetuning of the projection layer and the LLM, on a carefully designed recipe of diverse open-source datasets that include high-quality image captions and instruction-tuning pairs. Our quantitative analysis and visualization of Florence-VL's visual features show its advantages over popular vision encoders on vision-language alignment, where the enriched depth and breath play important roles. Florence-VL achieves significant improvements over existing state-of-the-art MLLMs across various multi-modal and vision-centric benchmarks covering general VQA, perception, hallucination, OCR, Chart, knowledge-intensive understanding, etc. To facilitate future research, our models and the complete training recipe are open-sourced. <a class="link-external link-https" href="https://github.com/JiuhaiChen/Florence-VL" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limitations of existing Vision - Language Models (VLMs) in terms of visual encoders. Specifically, the currently widely - used Transformer - based visual encoders, such as CLIP or SigLIP, are effective but they usually can only provide image - level semantic representations, ignoring pixel - level or region - level details and low - level features, and these details are crucial for many downstream tasks. In addition, in order to utilize the unique representations of different visual encoders, some studies have adopted a hybrid approach of multiple visual encoders, but this increases the computational cost of model training and deployment. For this reason, the paper proposes Florence - VL, which is a new family of Multimodal Large Language Models (MLLMs). It uses the generative visual foundation model Florence - 2 as a visual encoder. Florence - 2 can generate diverse visual features through different task prompts (such as image captioning, OCR, localization, etc.), thus adapting to the requirements of various downstream tasks. To integrate these visual features more effectively, the paper proposes a novel feature - fusion architecture - Depth - Breadth Fusion (DBFusion). This method can extract visual features from different depths and multiple prompts and effectively combine them with pre - trained language models (such as Phi 3.5 and LLama 3). Through this method, Florence - VL has achieved significant performance improvements in multiple multimodal and vision - centric benchmark tests, covering tasks such as general - purpose Visual Question Answering (VQA), perception, hallucination, OCR, and chart understanding. In addition, the paper also provides detailed experimental analysis and visualization results, proving that the visual representation of Florence - VL is superior to popular visual encoders such as CLIP and SigLIP in terms of vision - language alignment. To promote future research, the authors have made the model and the complete training scheme publicly available.

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Florence: A New Foundation Model for Computer Vision

InfMLLM: A Unified Framework for Visual-Language Tasks.

MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

LM4LV: A Frozen Large Language Model for Low-level Vision Tasks

ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

Enhancing Perception Capabilities of Multimodal LLMs with Training-free Fusion

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

EVLM: An Efficient Vision-Language Model for Visual Understanding

Unified Generative and Discriminative Training for Multi-modal Large Language Models

CogVLM: Visual Expert for Pretrained Language Models

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

COGVLM: VISUAL EXPERT FOR LARGE LANGUAGE MODELS

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone