Contextual Object Detection with Multimodal Large Language Models

Yuhang Zang,Wei Li,Jun Han,Kaiyang Zhou,Chen Change Loy
2024-08-12
Abstract:Recent Multimodal Large Language Models (MLLMs) are remarkable in vision-language tasks, such as image captioning and question answering, but lack the essential perception ability, i.e., object detection. In this work, we address this limitation by introducing a novel research problem of contextual object detection -- understanding visible objects within different human-AI interactive contexts. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. Moreover, we present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts, so as to locate, identify, and associate visual objects with language inputs for human-AI interaction. Our ContextDET involves three key submodels: (i) a visual encoder for extracting visual representations, (ii) a pre-trained LLM for multimodal context decoding, and (iii) a visual decoder for predicting bounding boxes given contextual object words. The new generate-then-detect framework enables us to detect object words within human vocabulary. Extensive experiments show the advantages of ContextDET on our proposed CODE benchmark, open-vocabulary detection, and referring image segmentation. Github: <a class="link-external link-https" href="https://github.com/yuhangzang/ContextDET" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper primarily explores the limitations of Multimodal Large Language Models (MLLMs) in vision-language tasks and introduces a new research problem—contextual object detection. Specifically, existing MLLMs perform well in tasks like image captioning and question answering but are inadequate in object detection. To address this shortcoming, the authors introduce a new research direction: understanding visible objects in different human-computer interaction contexts. The paper makes the following key contributions: 1. **Proposing a New Problem**: Contextual object detection, which involves understanding visible objects in different interaction contexts. 2. **Constructing a Dataset**: Introducing a new benchmark dataset named CODE, which includes a large number of unique object names to facilitate research in contextual object detection. 3. **Designing a Framework**: Proposing a new generation-detection framework called ContextDET, specifically for contextual object detection. 4. **Experimental Validation**: Demonstrating the advantages of ContextDET not only on the CODE benchmark but also validating it on open vocabulary detection and referring image segmentation tasks. ### Specific Goals of Contextual Object Detection The paper mentions four main goals of contextual object detection: 1. **Processing Capability**: Ability to handle object names in the human language vocabulary. 2. **Descriptive Capability**: Describing the visual input provided by the user in natural language. 3. **Perceptual Capability**: Locating and associating visual objects based on language queries. 4. **Understanding Capability**: Supplementing appropriate words based on language prompts. Through these three representative tasks (language cloze test, image captioning, and question answering), the paper explores how to achieve these goals in multimodal large language models.