Cross-modal Observation Hypothesis Inference
Mengze Li,Kairong Han,Jiahe Xu,Yueying Li,Tao Wu,Zhou Zhao,Jiaxu Miao,Shengyu Zhang,Jingyuan Chen
DOI: https://doi.org/10.1145/3664647.3681591
2024-01-01
Abstract:Hypothesis inference, a sophisticated cognitive process that allows humans to construct plausible explanations for incomplete observations, is paramount to our ability to make sense of the world around us. Despite the universality of this skill, it remains under-explored within the context of multi-modal AI, which necessitates analyzing observation, recalling information in the mind, and generating explanations. In this work, we propose the Cross-modal Observation hypothesIs iNference task (COIN). Given a textual description of a partially observed event, COIN strives to recall the most probable event from the visual mind (video pool), and infer the subsequent action flow connecting the visual mind event and the observed textural event. To advance the development of this field, we propose a large-scale text-video dataset, Tex-COIN, that contains 39,796 meticulously annotated hypothesis inference examples and auxiliary commonsense knowledge (appearance, clothing, action, etc.) for key video characters. Based on the proposed Tex-COIN dataset, we design a strong baseline, COINNet, which features two perspectives: 1) aligning temporally displaced textual observations with target videos via transformer-based multi-task learning, and 2) inferring the action flow with non-parametric graph-based inference grounded in graph theory. Extensive experiments on the Tex-COIN dataset validate the effectiveness of our COINNet by significantly outperforming the state-of-the-arts.