Focus and Align: Learning Tube Tokens for Video-Language Pre-training

Yongqing Zhu,Xiangyang Li,Mao Zheng,Jiahao Yang,Zihan Wang,Xiaoqian Guo,Zifeng Chai,Yuchen Yuan,Shuqiang Jiang
DOI: https://doi.org/10.1109/tmm.2022.3231108
IF: 7.3
2022-01-01
IEEE Transactions on Multimedia
Abstract:Video-language pre-training (VLP) has attracted increasing attention for cross-modality understanding tasks. To enhance visual representations, recent works attempt to adopt transformer-based architectures as video encoders. These works usually focus on the visual representations of the sampled frames. Compared with frame representations, frame patches incorporate more fine-grained spatio-temporal information, which could lead to a better understanding of video contents. However, how to exploit the spatio-temporal information within frame patches for VLP has been less investigated. In this work, we propose a method to learn tube tokens to model the key spatio-temporal information from frame patches. To this end, multiple semantic centers are introduced to focus on the underlying patterns of frame patches. Based on each semantic center, the spatio-temporal information within frame patches is integrated into a unique tube token. Complementary to frame representations, tube tokens provide detailed clues of video contents. Furthermore, to better align the generated tube tokens and the contents of descriptions, a local alignment mechanism is introduced. The experiments based on a variety of downstream tasks demonstrate the effectiveness of the proposed method.
computer science, information systems,telecommunications, software engineering
What problem does this paper attempt to address?