Abstract:We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks. While existing large vision models excel in transfer learning, they struggle to perform a diversity of tasks with simple instructions, a capability that implies handling the complexity of various spatial hierarchy and semantic granularity. Florence-2 was designed to take text-prompt as task instructions and generate desirable results in text forms, whether it be captioning, object detection, grounding or segmentation. This multi-task learning setup demands large-scale, high-quality annotated data. To this end, we co-developed FLD-5B that consists of 5.4 billion comprehensive visual annotations on 126 million images, using an iterative strategy of automated image annotation and model refinement. We adopted a sequence-to-sequence structure to train Florence-2 to perform versatile and comprehensive vision tasks. Extensive evaluations on numerous tasks demonstrated Florence-2 to be a strong vision foundation model contender with unprecedented zero-shot and fine-tuning capabilities.

What problem does this paper attempt to address?

The paper introduces Florence-2, an innovative visual grounding model aimed at handling various computer vision and visual language tasks in a unified manner. While existing large-scale visual models have shown excellent performance in transfer learning, they face challenges in executing diverse tasks using simple instructions. Florence-2, on the other hand, is designed to accept text prompts as task instructions and generate text-based results, including but not limited to image captioning, object detection, localization, or segmentation. To achieve this multi-task learning, the research team jointly developed the FLD-5B dataset, which includes 5.4 billion comprehensive visual annotations on 126 million images, generated through automated image annotation and iteratively refined models. Florence-2 is trained using a sequence-to-sequence architecture to perform various complex visual tasks. The paper demonstrates that through extensive evaluations on a wide range of tasks, Florence-2 proves itself as a strong competitor among visual grounding models, with unprecedented zero-shot and fine-tuning capabilities. It performs well on multiple tasks, including image captioning and visual localization in the zero-shot and fine-tuning settings. The study also mentions that building a universal visual representation faces unique challenges, such as the need to understand complex spatial hierarchies and semantic granularities. To address these issues, the paper proposes a comprehensive multi-task learning approach that utilizes large-scale, high-quality annotated data and designs a unified architecture model capable of handling different visual tasks without requiring specific task-specific modifications. Compared to its predecessor model Florence, Florence-2 improves the model's generalization and adaptability while reducing dependencies on large task-specific datasets and adapters.

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Florence: A New Foundation Model for Computer Vision

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Universal Object Detection with Large Vision Model

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized Multimodal Framework

FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

EVLM: An Efficient Vision-Language Model for Visual Understanding

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

IVGF: The Fusion-Guided Infrared and Visible General Framework

InfMLLM: A Unified Framework for Visual-Language Tasks.

A Unified Sequence Interface for Vision Tasks

Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

Aligning and Prompting Everything All at Once for Universal Visual Perception

12-in-1: Multi-Task Vision and Language Representation Learning