Scene-text Oriented Visual Entailment: Task, Dataset and Solution
Nan Li,Pijian Li,Dongsheng Xu,Wenye Zhao,Yi Cai,Qingbao Huang
DOI: https://doi.org/10.1145/3581783.3612593
2023-01-01
Abstract:Visual Entailment (VE) is a fine-grained reasoning task aiming to predict whether the image semantically entails a hypothesis in textual form.Existing studies of VE only focus on basic visual attributes but largely overlook the importance of scene text, which usually entails rich semantic information and crucial clues (e.g., time, place, affiliation, and topic), leading to superficial design of hypothesis or incorrect entailment prediction. To fill this gap, we propose a new task called scene-text oriented Visual Entailment (STOVE), which requires models to predict whether an image semantically entails the corresponding hypothesis designed based on the scene text-centered visual information.STOVE task challenges a model to deeply understand the interplay between language and images containing scene text, requiring aligning hypotheses tokens, scene text, and visual contents.To support the researches on STOVE, we further collect a dataset termed TextVE, consisting of 23,864 images and 47,728 hypotheses related to scene text, which is constructed with the strategy of minimizing biases.Additionally, we present a baseline named MMTVE applying a multimodal transformer to model the spatial, semantic, and visual reasoning relations between multiple scene text tokens, hypotheses, and visual features.Experimental results illustrate that our model is effective in comprehending STOVE and achieves outstanding performance.Our codes are available at https://github.com/VISLANG-Lab/TextVE.