Multimodal Cross-lingual Summarization for Videos: A Revisit in Knowledge Distillation Induced Triple-stage Training Method
Nayu Liu,Kaiwen Wei,Yong Yang,Jianhua Tao,Xian Sun,Fanglong Yao,Hongfeng Yu,Li Jin,Zhao Lv,Cunhang Fan
DOI: https://doi.org/10.1109/TPAMI.2024.3447778
2024-08-22
Abstract:Multimodal summarization (MS) for videos aims to generate summaries from multi-source information (e.g., video and text transcript), and this technique has made promising progress recently. However, existing works are limited to monolingual video scenarios, overlooking the demands of non-native language video viewers to understand cross-lingual videos in practical applications. It stimulates us to introduce multimodal cross-lingual summarization for videos (MCLS), which aims at generating cross-lingual summarization from multimodal input of videos. Considering the challenge of high annotation cost and resource constraints in MCLS, we propose a knowledge distillation (KD) induced triple-stage training method to assist MCLS by transferring knowledge from abundant monolingual MS data to those data with insufficient volumes. In the triple-stage training method, a video-guided dual fusion network (VDF) is designed as the backbone network to integrate multimodal and cross-lingual information through different fusion strategies of encoder and decoder; what's more, we propose two cross-lingual knowledge distillation strategies: adaptive pooling distillation and language-adaptive warping distillation (LAWD). These strategies are tailored for distillation objects (i.e., encoder-level and vocab-level KD) to facilitate effective knowledge transfer across cross-lingual sequences of varying lengths between MS and MCLS models. Specifically, to tackle the challenge of unequal length of parallel cross-language sequences in KD, our proposed LAWD can directly conduct cross-language distillation while keeping the language feature shape unchanged to reduce potential information loss. We meticulously annotated the How2-MCLS dataset based on the How2 dataset to simulate the MCLS scenario. The experimental results show that the proposed method achieves competitive performance compared to strong baselines, and can bring substantial performance improvements to MCLS models by transferring knowledge from the MS model.