RecFormer: Recurrent Multi-modal Transformer with History-Aware Contrastive Learning for Visual Dialog.

Liucun Lu,Jinghui Qin,Zequn Jie,Lin Ma,Liang Lin,Xiaodan Liang
DOI: https://doi.org/10.1007/978-981-99-8429-9_13
2024-01-01
Abstract:Recently, benefiting from the powerful representation ability learned from large-scale image-text pre-training, pre-trained vision-language models show significant improvements in visual dialog task. However, these works suffer from two main challenges: 1) how to incorporate the sequential nature of multi-turn dialog systems for better capturing temporal dependencies of visual dialog; 2) how to align the semantics among different modal-specific features for better multi-modal interactions and understandings. To address the above issues, we propose a recurrent multi-modal transformer (named RecFormer) to capture temporal dependencies between utterances via encoding dialog utterances and interacting with visual information turn by turn. Specifically, we equip a pre-trained transformer with a recurrent function that maintains cross-modal history encoding for the dialog agent. Thus, the dialog agent can make better predictions by considering temporal dependencies. Besides, we also propose history-aware contrastive learning as an auxiliary task to align visual features and dialog history features for improving visual dialog understanding. The experimental results demonstrate that our RecFormer can achieve new state-of-the-art performances on both VisDial v0.9 (72.52 MRR score and 60.47 R@1 on val split) and VisDial v1.0 (69.29 MRR score and 55.90 R@1 on test-std split) datasets.
What problem does this paper attempt to address?