Abstract:Multimodal summarization (MS) for videos aims to generate summaries from multi-source information (e.g., video and text transcript), and this technique has made promising progress recently. However, existing works are limited to monolingual video scenarios, overlooking the demands of non-native language video viewers to understand cross-lingual videos in practical applications. It stimulates us to introduce multimodal cross-lingual summarization for videos (MCLS), which aims at generating cross-lingual summarization from multimodal input of videos. Considering the challenge of high annotation cost and resource constraints in MCLS, we propose a knowledge distillation (KD) induced triple-stage training method to assist MCLS by transferring knowledge from abundant monolingual MS data to those data with insufficient volumes. In the triple-stage training method, a video-guided dual fusion network (VDF) is designed as the backbone network to integrate multimodal and cross-lingual information through different fusion strategies of encoder and decoder; what's more, we propose two cross-lingual knowledge distillation strategies: adaptive pooling distillation and language-adaptive warping distillation (LAWD). These strategies are tailored for distillation objects (i.e., encoder-level and vocab-level KD) to facilitate effective knowledge transfer across cross-lingual sequences of varying lengths between MS and MCLS models. Specifically, to tackle the challenge of unequal length of parallel cross-language sequences in KD, our proposed LAWD can directly conduct cross-language distillation while keeping the language feature shape unchanged to reduce potential information loss. We meticulously annotated the How2-MCLS dataset based on the How2 dataset to simulate the MCLS scenario. The experimental results show that the proposed method achieves competitive performance compared to strong baselines, and can bring substantial performance improvements to MCLS models by transferring knowledge from the MS model.

Enhance Language Identification using Dual-mode Model with Knowledge Distillation

Accelerating Multiple Intent Detection and Slot Filling Via Targeted Knowledge Distillation

I$^2$KD-SLU: An Intra-Inter Knowledge Distillation Framework for Zero-Shot Cross-Lingual Spoken Language Understanding

I^2KD-SLU: an Intra-Inter Knowledge Distillation Framework for Zero-Shot Cross-Lingual Spoken Language Understanding

A dual task learning approach to fine-tune a multilingual semantic speech encoder for Spoken Language Understanding

Aligner²: Enhancing Joint Multiple Intent Detection and Slot Filling Via Adjustive and Forced Cross-Task Alignment

PHO-LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic Information for Language Identification

Integrated Multi-Level Knowledge Distillation for Enhanced Speaker Verification

Using Iterative Adaptation and Dynamic Mask for Child Speech Extraction under Real-World Multilingual Conditions

Modality Blur and Batch Alignment Learning for Twin Noisy Labels-based Visible–infrared Person Re-identification

Multimodal Cross-lingual Summarization for Videos: A Revisit in Knowledge Distillation Induced Triple-stage Training Method

High-resolution Acoustic Modeling and Compact Language Modeling of Language-Universal Speech Attributes for Spoken Language Identification.

Improving Mispronunciation Detection with Wav2vec2-based Momentum Pseudo-Labeling for Accentedness and Intelligibility Assessment

Spoken Language Identification System for English-Mandarin Code-Switching Child-Directed Speech

Language-aware PLDA for multilingual speaker recognition

Spatio-Temporal Attention Mechanism and Knowledge Distillation for Lip Reading

Simple yet Effective Code-Switching Language Identification with Multitask Pre-Training and Transfer Learning

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.

Generative linguistic representation for spoken language identification

Exploiting Spectral Augmentation for Code-Switched Spoken Language Identification

Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking