Abstract:Multimodal land use/cover classification using optical and synthetic aperture radar (SAR) images has attracted significant attention because the unique radiation and geometric characteristics of these images provide complementary information regarding land properties. However, the significant differences between these modalities create a large semantic gap, posing challenges for effective feature fusion in multimodal learning. Moreover, missing modalities often occur in practical applications due to weather constraints or sensor malfunctions, posing challenges to achieving high performance in cross-modal learning. In this study, we proposed a multimodal online knowledge distillation (MMOKD) framework, designed for land use/cover classification of optical and SAR images using either full or missing modalities. This framework trains one modality-fusion network alongside two modality-specific networks in an end-to-end manner, facilitating both multimodal and cross-modal learning. More specifically, we developed a multimodal feature fusion (MFF) module for integrating heterogeneous features and a single-modal feature generation (SFG) module for encapsulating cross-modal complementary information. In addition, we proposed the joint distillation with multitype fusion knowledge (JD-MFK) method, guiding the modality-specific student networks to comprehensively learn the modality-fusion teacher network. Notably, we adopted an online distillation strategy for real-time feedback and synchronous updates of both modality-fusion and modality-specific networks. Finally, we conducted extensive experiments on two multimodal land use/classification datasets with advanced multimodal fusion, cross-modal distillation, and specific baseline networks for comparison. The results demonstrate the effectiveness of the proposed MMODD, which not only outperforms the other networks in both full- and missing-modality scenarios but also significantly improves model training efficiency.

Module-wise Adaptive Distillation for Multimodality Foundation Models

Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning

One-stage Modality Distillation for Incomplete Multimodal Learning

Unlock the Power: Competitive Distillation for Multi-Modal Large Language Models

Cross-modality Online Distillation for Multi-View Action Recognition

LLAVADI: What Matters For Multimodal Large Language Models Distillation

Bi-Level Orthogonal Multi-Teacher Distillation

Self-Improving Teacher Cultivates Better Student: Distillation Calibration for Multimodal Large Language Models

MMANet: Margin-aware Distillation and Modality-aware Regularization for Incomplete Multimodal Learning

MOD: A Deep Mixture Model with Online Knowledge Distillation for Large Scale Video Temporal Concept Localization

VideoAdviser: Video Knowledge Distillation for Multimodal Transfer Learning

Multimodal Online Knowledge Distillation Framework for Land Use/Cover Classification Using Full or Missing Modalities

A Generalization Theory of Cross-Modality Distillation with Contrastive Learning

Online Knowledge Distillation via Multi-branch Diversity Enhancement

MSD: Saliency-aware Knowledge Distillation for Multimodal Understanding

On-the-fly Modulation for Balanced Multimodal Learning

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

Learning Robust Anymodal Segmentor with Unimodal and Cross-modal Distillation

Multimodal Transformer Distillation for Audio-Visual Synchronization