FineFormer: Fine-Grained Adaptive Object Transformer for Image Captioning

Bo Wang,Zhao Zhang,Jicong Fan,Mingbo Zhao,Choujun Zhan,Mingliang Xu
DOI: https://doi.org/10.1109/icdm54844.2022.00061
2022-01-01
Abstract:Image captioning is still a challenging task aiming at describing the contents of image by words. Current image caption methods usually assume the object relation to be important if the semantic and spatial geometric relationships between objects are close and large, but the relations meeting this assumption are not necessarily important to describe the contents of image in a fine-grained way. That is, the importance of fine-grained object relations is not properly taken into account. Besides, current Transformer based image caption models also fail to consider the importance of fine-grained objects, since they generate all the words of a sentence at one time, which cannot Figure out which objects are more important and vice versa. In this paper, we propose a novel Fine-grained Adaptive Object Transformer (FineFormer) network, which can jointly discover the importance of fine-grained objects and object relations for image captioning. Specifically, a new concept of adaptive soft-foreground attention is proposed to highlight the fine-grained objects dominating the descriptive contents. To characterize and calculate the important relations between fine-grained objects, we also propose an adaptive object relation attention to refine the object relation from the generation process of relation. As such, FineFormer can describe the contents of image more accurately, by reducing the interference of unimportant objects in the background. Extensive experiments on the highly-competitive MS-COCO dataset demonstrated the superiority of our FineFormer.
What problem does this paper attempt to address?