Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Bin Xiao,Haiping Wu,Weijian Xu,Xiyang Dai,Houdong Hu,Yumao Lu,Michael Zeng,Ce Liu,Lu Yuan
2023-11-11
Abstract:We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks. While existing large vision models excel in transfer learning, they struggle to perform a diversity of tasks with simple instructions, a capability that implies handling the complexity of various spatial hierarchy and semantic granularity. Florence-2 was designed to take text-prompt as task instructions and generate desirable results in text forms, whether it be captioning, object detection, grounding or segmentation. This multi-task learning setup demands large-scale, high-quality annotated data. To this end, we co-developed FLD-5B that consists of 5.4 billion comprehensive visual annotations on 126 million images, using an iterative strategy of automated image annotation and model refinement. We adopted a sequence-to-sequence structure to train Florence-2 to perform versatile and comprehensive vision tasks. Extensive evaluations on numerous tasks demonstrated Florence-2 to be a strong vision foundation model contender with unprecedented zero-shot and fine-tuning capabilities.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper introduces Florence-2, an innovative visual grounding model aimed at handling various computer vision and visual language tasks in a unified manner. While existing large-scale visual models have shown excellent performance in transfer learning, they face challenges in executing diverse tasks using simple instructions. Florence-2, on the other hand, is designed to accept text prompts as task instructions and generate text-based results, including but not limited to image captioning, object detection, localization, or segmentation. To achieve this multi-task learning, the research team jointly developed the FLD-5B dataset, which includes 5.4 billion comprehensive visual annotations on 126 million images, generated through automated image annotation and iteratively refined models. Florence-2 is trained using a sequence-to-sequence architecture to perform various complex visual tasks. The paper demonstrates that through extensive evaluations on a wide range of tasks, Florence-2 proves itself as a strong competitor among visual grounding models, with unprecedented zero-shot and fine-tuning capabilities. It performs well on multiple tasks, including image captioning and visual localization in the zero-shot and fine-tuning settings. The study also mentions that building a universal visual representation faces unique challenges, such as the need to understand complex spatial hierarchies and semantic granularities. To address these issues, the paper proposes a comprehensive multi-task learning approach that utilizes large-scale, high-quality annotated data and designs a unified architecture model capable of handling different visual tasks without requiring specific task-specific modifications. Compared to its predecessor model Florence, Florence-2 improves the model's generalization and adaptability while reducing dependencies on large task-specific datasets and adapters.