Visual Dialog for Spotting the Differences Between Pairs of Similar Images

Duo Zheng,Fandong Meng,Qingyi Si,Hairun Fan,Zipeng Xu,Jie Zhou,Fangxiang Feng,Xiaojie Wang
DOI: https://doi.org/10.1145/3503161.3548170
2022-01-01
Abstract:Visual dialog has witnessed great progress after introducing various vision-oriented goals into the conversation. Much of previous work focuses on tasks where only one image can be accessed by two interlocutors, such as VisDial and GuessWhat. The work on situations where two interlocutors access different images has received less attention. Those situations are common in real world and bring some different challenges compared with one-image tasks. The lack of such types of dialog tasks and corresponding large-scale datasets makes it impossible to carry out in-depth research. This paper therefore first proposes a new visual dialog task named Dial-the-Diff, where two interlocutors accessing two similar images respectively try to spot the difference between the images through conversing in natural language. The task raises new challenges to the dialog strategy and the ability of categorizing objects. We then build a large-scale multi-modal dataset for the task, named DialDiff, which contains 87k Virtual Reality images and 78k dialogs. Some details of the data are given and analyzed to highlight the challenges behind the task. Finally, we propose benchmark models for this task, and conduct extensive experiments to evaluate their performance as well as its problems remained.
What problem does this paper attempt to address?