Abstract:Multimodal pre-training for audio-and-text has recently been proved to be effective and has significantly improved the performance of many downstream speech understanding tasks. However, these state-of-the-art pre-training audio-text models work well only when provided with large amount of parallel audio-and-text data, which brings challenges on many languages that are rich in unimodal corpora but scarce of parallel cross-modal corpus. In this paper, we investigate whether it is possible to pre-train an audio-text multimodal model with extremely low-resource parallel data and extra non-parallel unimodal data. Our pre-training framework consists of the following components: (1) Intra-modal Denoising Auto-Encoding (IDAE), which is able to reconstruct input text (audio) representations from a noisy version of itself. (2) Cross-modal Denoising Auto-Encoding (CDAE), which is pre-trained to reconstruct the input text (audio), given both a noisy version of the input text (audio) and the corresponding translated noisy audio features (text embeddings). (3) Iterative Denoising Process (IDP), which iteratively translates raw audio (text) and the corresponding text embeddings (audio features) translated from previous iteration into the new less-noisy text embeddings (audio features). We adapt a dual cross-modal Transformer as our backbone model which consists of two unimodal encoders for IDAE and two cross-modal encoders for CDAE and IDP. Our method achieves comparable performance on multiple downstream speech understanding tasks compared with the model pre-trained on fully parallel data, demonstrating the great potential of the proposed method. Our code is available at: \url{<a class="link-external link-https" href="https://github.com/KarlYuKang/Low-Resource-Multimodal-Pre-training" rel="external noopener nofollow">this https URL</a>}.

CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

CrossMAE: Cross Modality Masked Autoencoders for Region-Aware Audio-Visual Pretraining

Learning Speech Representation From Contrastive Token-Acoustic Pretraining

Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-training

Unified Video-Language Pre-training with Synchronized Audio

VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition

MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation

Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing

CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training

Pre-training for Speech Translation: CTC Meets Optimal Transport

Cross-modal Alignment with Optimal Transport for CTC-based ASR

Cascaded Cross-Modal Transformer for Audio-Textual Classification

CALM: Contrastive Aligned Audio-Language Multirate and Multimodal Representations

Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling

Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data