Multimodal Inplace Prompt Tuning for Open-set Object Detection

Guilin Li,Mengdan Zhang,Xiawu Zheng,Peixian Chen,Zihan Wang,Yunhang Shen,Mingchen Zhuge,Chenglin Wu,Fei Chao,Ke Li,Xing Sun,Rongrong Ji
DOI: https://doi.org/10.1145/3664647.3681275
2024-01-01
Abstract:The integration of large language models into open-world detection frameworks significantly improves versatility in new environments. Prompt representations derived from these models help establish classification boundaries for both base and novel categories within open-world detectors. However, we are the first to discover that directly fine-tuning language models in detection systems results in redundant attention patterns and leads to suboptimal prompt representations. In order to fully leverage the capabilities of large language models and augment prompt encoding for detection, this study introduces a redundancy assessment metric to identify uniform attention patterns. Furthermore, in areas with high redundancy, we incorporate multimodal inplace prompt tuning (MIPT) to enrich the text prompt with visual clues. Experimental results validate the efficacy of our MIPT framework, achieving a notable increase across benchmarks, e.g. elevating GLIP-L from 22.6% to 25.0% on ODinW-35, and 9.0% improvement on LVIS.
What problem does this paper attempt to address?