Abstract:Recent Multimodal Large Language Models (MLLMs) are remarkable in vision-language tasks, such as image captioning and question answering, but lack the essential perception ability, i.e., object detection. In this work, we address this limitation by introducing a novel research problem of contextual object detection -- understanding visible objects within different human-AI interactive contexts. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. Moreover, we present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts, so as to locate, identify, and associate visual objects with language inputs for human-AI interaction. Our ContextDET involves three key submodels: (i) a visual encoder for extracting visual representations, (ii) a pre-trained LLM for multimodal context decoding, and (iii) a visual decoder for predicting bounding boxes given contextual object words. The new generate-then-detect framework enables us to detect object words within human vocabulary. Extensive experiments show the advantages of ContextDET on our proposed CODE benchmark, open-vocabulary detection, and referring image segmentation. Github: <a class="link-external link-https" href="https://github.com/yuhangzang/ContextDET" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper primarily explores the limitations of Multimodal Large Language Models (MLLMs) in vision-language tasks and introduces a new research problem—contextual object detection. Specifically, existing MLLMs perform well in tasks like image captioning and question answering but are inadequate in object detection. To address this shortcoming, the authors introduce a new research direction: understanding visible objects in different human-computer interaction contexts. The paper makes the following key contributions: 1. **Proposing a New Problem**: Contextual object detection, which involves understanding visible objects in different interaction contexts. 2. **Constructing a Dataset**: Introducing a new benchmark dataset named CODE, which includes a large number of unique object names to facilitate research in contextual object detection. 3. **Designing a Framework**: Proposing a new generation-detection framework called ContextDET, specifically for contextual object detection. 4. **Experimental Validation**: Demonstrating the advantages of ContextDET not only on the CODE benchmark but also validating it on open vocabulary detection and referring image segmentation tasks. ### Specific Goals of Contextual Object Detection The paper mentions four main goals of contextual object detection: 1. **Processing Capability**: Ability to handle object names in the human language vocabulary. 2. **Descriptive Capability**: Describing the visual input provided by the user in natural language. 3. **Perceptual Capability**: Locating and associating visual objects based on language queries. 4. **Understanding Capability**: Supplementing appropriate words based on language prompts. Through these three representative tasks (language cloze test, image captioning, and question answering), the paper explores how to achieve these goals in multimodal large language models.

Contextual Object Detection with Multimodal Large Language Models

CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models

Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

Improving Context Understanding in Multimodal Large Language Models Via Multimodal Composition Learning

Visual-Linguistic Agent: Towards Collaborative Contextual Object Reasoning

Context-LGM: Leveraging Object-Context Relation for Context-Aware Object Recognition

Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models

OmDet: Large‐scale vision‐language multi‐dataset pre‐training with multimodal detection network

Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection

Multi-modal Queried Object Detection in the Wild

Empowering Corner Case Detection in Autonomous Vehicles with Multimodal Large Language Models

Chain of Visual Perception: Harnessing Multimodal Large Language Models for Zero-shot Camouflaged Object Detection

Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions

MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs

ContextHOI: Spatial Context Learning for Human-Object Interaction Detection

Multimodal Instruction Tuning with Hybrid State Space Models

OCC-MLLM:Empowering Multimodal Large Language Model For the Understanding of Occluded Objects

Large Model Based Referring Camouflaged Object Detection

Leveraging Chat-Based Large Vision Language Models for Multimodal Out-Of-Context Detection

Exploring the Design Space of Visual Context Representation in Video MLLMs