Abstract:We present GLEE in this work, an object-level foundation model for locating and identifying objects in images and videos. Through a unified framework, GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario for various object perception tasks. Adopting a cohesive learning strategy, GLEE acquires knowledge from diverse data sources with varying supervision levels to formulate general object representations, excelling in zero-shot transfer to new data and tasks. Specifically, we employ an image encoder, text encoder, and visual prompter to handle multi-modal inputs, enabling to simultaneously solve various object-centric downstream tasks while maintaining state-of-the-art performance. Demonstrated through extensive training on over five million images from diverse benchmarks, GLEE exhibits remarkable versatility and improved generalization performance, efficiently tackling downstream tasks without the need for task-specific adaptation. By integrating large volumes of automatically labeled data, we further enhance its zero-shot generalization capabilities. Additionally, GLEE is capable of being integrated into Large Language Models, serving as a foundational model to provide universal object-level information for multi-modal tasks. We hope that the versatility and universality of our method will mark a significant step in the development of efficient visual foundation models for AGI systems. The model and code will be released at <a class="link-external link-https" href="https://glee-vision.github.io" rel="external noopener nofollow">this https URL</a> .

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to construct a general object - level foundation model capable of handling multiple object - perception tasks in large - scale image and video data. Specifically, the paper introduces a model named GLEE. Through a unified framework, this model can complete multiple tasks such as object detection, segmentation, tracking, localization, and recognition in open - world scenarios. GLEE adopts a consistent learning strategy to acquire knowledge from diverse data sources with different levels of supervision, forming a general object representation, thus demonstrating excellent zero - shot transfer ability on new data and tasks. ### Main Contributions 1. **Unified Input - Output Paradigm**: GLEE designs a unified input - output paradigm, enabling the model to learn from a large amount of diverse data and predict general object representations, thereby achieving zero - shot generalization on new data and tasks. 2. **Multi - modal Processing Capability**: GLEE combines an image encoder, a text encoder, and a visual prompter, and is able to handle multi - modal inputs while solving multiple object - centered tasks, such as detection, instance segmentation, referring expression understanding, interactive segmentation, and tracking. 3. **Strong Zero - shot Transfer Ability**: By jointly training on data of more than five million images, GLEE shows excellent generalization ability and zero - shot transfer ability, and can reach the state - of - the - art performance on multiple tasks without task - specific design or fine - tuning. 4. **Expansion of Training Data**: By introducing a large amount of automatically annotated data, such as SA1B and GRIT, GLEE can expand the scale of training data at a lower cost, further enhancing the zero - shot generalization ability. ### Technical Details - **Model Architecture**: - **Image Encoder**: Extracts multi - scale features of images. - **Text Encoder**: Processes any task - related descriptions, including object categories, names, titles, and referring expressions. - **Visual Prompter**: Encodes points, bounding boxes, or scribbles input by users, generating the corresponding visual representations of target objects. - **Object Decoder**: Integrates the above modules and extracts objects in images according to text and visual inputs. - **Loss Function**: - **Semantic Loss**: Uses Focal Loss to align text concepts and object features. - **Box Loss**: Uses L1 loss and Generalized Intersection over Union Loss (GIoU Loss) for box prediction. - **Mask Loss**: Uses a combination of Dice loss and Focal Loss. - **Confidence Loss**: Used for visual - prompt segmentation tasks to predict the confidence score of each object query. - **Contrastive Tracking Loss**: Used for video tasks to make the embeddings of the same object instance closer in the embedding space and the embeddings of different object instances farther. - **Data Expansion**: - Introduces a large amount of automatically annotated data, such as SA1B and GRIT, to expand the scale of training data and improve the generalization ability of the model. ### Experimental Results The paper verifies the performance of GLEE on multiple object - perception tasks through extensive experiments, including detection, instance segmentation, referring expression understanding, open - world detection, etc. The experimental results show that GLEE has reached the state - of - the - art performance on these tasks, and is particularly excellent in zero - shot transfer. ### Conclusion Through a unified framework and multi - modal processing ability, GLEE has successfully solved the challenge of constructing a general object - level foundation model in large - scale image and video data, laying a solid foundation for future research on visual foundation models.

General Object Foundation Model for Images and Videos at Scale

PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects

GVGNet: Gaze-Directed Visual Grounding for Learning Under-Specified Object Referring Intention

VideoGLUE: Video General Understanding Evaluation of Foundation Models

GiVE: Guiding Visual Encoder to Perceive Overlooked Information

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

Universal Object Detection with Large Vision Model

GIM: Learning Generalizable Image Matcher From Internet Videos

Salient Region Detection and Segmentation for General Object Recognition and Image Understanding

InternVideo: General Video Foundation Models Via Generative and Discriminative Learning

Aligning and Prompting Everything All at Once for Universal Visual Perception

Generalizable Entity Grounding via Assistance of Large Language Model

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Florence: A New Foundation Model for Computer Vision

InfMLLM: A Unified Framework for Visual-Language Tasks.

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding