Object-Centric Cross-Modal Knowledge Reasoning for Future Event Prediction in Videos

Chenghang Lai,Haibo Wang,Weifeng Ge,Xiangyang Xue
DOI: https://doi.org/10.1109/tcsvt.2024.3444895
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Although multi-modal large language models possess impressive cross-modal reasoning and prediction capabilities, they lack a unified and rigorous evaluation standard. In this paper, we introduce a future event prediction task to assess the cross-modal temporal prediction capabilities of these models. This task requires the model to generate descriptions of events that may occur in the future based on input video. To tackle this new task, we propose an object-centric cross-modal knowledge reasoning framework, which combines a basic information encoder, an adaptive multi-segment filter, a spatial-temporal relation encoder, a vision-text interaction module, and a pre-trained large language model decoder. The adaptive multi-segment filter captures selectively capture critical visual information in videos, enhancing the model’s focus on relevant features. The spatial-temporal relation encoder decomposes and associates the objects and scene information in the video. Additionally, the vision-text interaction module enhances the connection between visual sequences and their corresponding textual narratives, ensuring semantic coherence and consistency. To evaluate our framework, we constructed a dataset containing descriptions, dialogues of future events, and object-centric event reasoning chains. Experimental results indicate that the proposed framework outperforms all previous methods for future event prediction. Ablation studies further demonstrate the effectiveness of the designed modules.
What problem does this paper attempt to address?