Exploring Vision-Language Foundation Model for Novel Object Captioning

Jianjie Luo,Yehao Li,Yingwei Pan,Ting Yao,Jianlin Feng,Hongyang Chao,Tao Mei
DOI: https://doi.org/10.1109/tcsvt.2024.3452437
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:It is always well believed that pre-trained vision-language foundation models (e.g., CLIP) would substantially facilitate vision-language tasks. Nevertheless, there has been less evidence in support of the idea on describing novel objects in images. In this paper, we propose the Novel Object Transformer with CLIP (NOTC), a Transformer-based model that innovatively exploits the powerful vision-language representation ability of CLIP to enhance novel object captioning model’s training and sentence decoding processes. Technically, given the primary bag-of-objects extracted by Faster R-CNN, NOTC first capitalize on an object distiller module to emphasize the most salient objects and infer the missing novel ones. The refined object words are additionally fed into the object-centric word predictor to generate sentence word-by-word. During training, we design a CLIP-based self-critical sequence training paradigm to select visually-grounded sampled sentence with higher CLIP score reward, which enables a joint training process of captioning model over out-domain training images with novel objects. Moreover, at inference, a new CLIP beam search algorithm is devised to enforce the existence of novel objects and encourage the partial word sequences with higher CLIP scores, thereby decoding both visually-grounded and comprehensive sentences. Extensive experiments are conducted on held-out COCO and nocaps datasets, and competitive performances are reported when compared to state-of-the-art approaches.
What problem does this paper attempt to address?