Abstract:Multimodal transfer learning aims to transform pretrained representations of diverse modalities into a common domain space for effective multimodal fusion. However, conventional systems are typically built on the assumption that all modalities exist, and the lack of modalities always leads to poor inference performance. Furthermore, extracting pretrained embeddings for all modalities is computationally inefficient for inference. In this work, to achieve high efficiency-performance multimodal transfer learning, we propose VideoAdviser, a video knowledge distillation method to transfer multimodal knowledge of video-enhanced prompts from a multimodal fundamental model (teacher) to a specific modal fundamental model (student). With an intuition that the best learning performance comes with professional advisers and smart students, we use a CLIP-based teacher model to provide expressive multimodal knowledge supervision signals to a RoBERTa-based student model via optimizing a step-distillation objective loss—first step: the teacher distills multimodal knowledge of video-enhanced prompts from classification logits to a regression logit—second step: the multimodal knowledge is distilled from the regression logit of the teacher to the student. We evaluate our method in two challenging multimodal tasks: video-level sentiment analysis (MOSI and MOSEI datasets) and audio-visual retrieval (VEGAS dataset). The student (requiring only the text modality as input) achieves an MAE score improvement of up to 12.3% for MOSI and MOSEI. Our method further enhances the state-of-the-art method by 3.4% mAP score for VEGAS without additional computations for inference. These results suggest the strengths of our method for achieving high efficiency-performance multimodal transfer learning.

Improving Multi-Modal Learning with Uni-Modal Teachers

On Uni-Modal Feature Learning in Supervised Multi-Modal Learning

On Uni-modal Feature Learning in Multi-modal Learning

A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning

Cross-modality Online Distillation for Multi-View Action Recognition

Improving Discriminative Multi-Modal Learning with Large-Scale Pre-Trained Models

Improving the Modality Representation with Multi-View Contrastive Learning for Multimodal Sentiment Analysis

Learning Robust Anymodal Segmentor with Unimodal and Cross-modal Distillation

What Makes Multi-modal Learning Better than Single (Provably)

Learn to Combine Modalities in Multimodal Deep Learning

Improving Cross-Modal Image-Text Retrieval With Teacher-Student Learning

Improving Unimodal Inference with Multimodal Transformers

Uni-to-Multi Modal Knowledge Distillation for Bidirectional LiDAR-Camera Semantic Segmentation

What Makes for Robust Multi-Modal Models in the Face of Missing Modalities?

Dense Multimodal Fusion for Hierarchically Joint Representation

Multi-Modal Fusion-Based Multi-Task Semantic Communication System

One-stage Modality Distillation for Incomplete Multimodal Learning

Robust Navigation with Cross-Modal Fusion and Knowledge Transfer

Unlock the Power: Competitive Distillation for Multi-Modal Large Language Models

Interpretation on Multi-modal Visual Fusion

VideoAdviser: Video Knowledge Distillation for Multimodal Transfer Learning