Multi-grained Representation Aggregating Transformer with Gating Cycle for Change Captioning

Shengbin Yue,Yunbin Tu,Liang Li,Shengxiang Gao,Zhengtao Yu
DOI: https://doi.org/10.1145/3660346
2024-04-22
Abstract:Change captioning aims to describe the difference within an image pair in natural language, which combines visual comprehension and language generation. Although significant progress has been achieved, it remains a key challenge of perceiving the object change from different perspectives, especially the severe situation with drastic viewpoint change. In this paper, we propose a novel full-attentive network, namely Multi-grained Representation Aggregating Transformer (MURAT), to distinguish the actual change from viewpoint change. Specifically, the Pair Encoder first captures similar semantics between pairwise objects in a multi-level manner, which are regarded as the semantic cues of distinguishing the irrelevant change. Next, a novel Multi-grained Representation Aggregator (MRA) is designed to construct the reliable difference representation by employing both coarse- and fine-grained semantic cues. Finally, the language decoder generates a description of the change based on the output of MRA. Besides, the Gating Cycle Mechanism is introduced to facilitate the semantic consistency between difference representation learning and language generation with a reverse manipulation, so as to bridge the semantic gap between change features and text features. Extensive experiments demonstrate that the proposed MURAT can greatly improve the ability to describe the actual change in the distraction of irrelevant change and achieves state-of-the-art performance on three benchmarks, CLEVR-Change, CLEVR-DC and Spot-the-Diff.
computer science, information systems, theory & methods, software engineering
What problem does this paper attempt to address?