Abstract:Self-supervised learning (SSL) using masked prediction has made great strides in general-purpose audio representation. This study proposes Masked Modeling Duo (M2D), an improved masked prediction SSL, which learns by predicting representations of masked input signals that serve as training signals. Unlike conventional methods, M2D obtains a training signal by encoding only the masked part, encouraging the two networks in M2D to model the input. While M2D improves general-purpose audio representations, a specialized representation is essential for real-world applications, such as in industrial and medical domains. The often confidential and proprietary data in such domains is typically limited in size and has a different distribution from that in pre-training datasets. Therefore, we propose M2D for X (M2D-X), which extends M2D to enable the pre-training of specialized representations for an application X. M2D-X learns from M2D and an additional task and inputs background noise. We make the additional task configurable to serve diverse applications, while the background noise helps learn on small data and forms a denoising task that makes representation robust. With these design choices, M2D-X should learn a representation specialized to serve various application needs. Our experiments confirmed that the representations for general-purpose audio, specialized for the highly competitive AudioSet and speech domain, and a small-data medical task achieve top-level performance, demonstrating the potential of using our models as a universal audio pre-training framework. Our code is available online for future studies at

MDSGen: Fast and Efficient Masked Diffusion Temporal-Aware Transformers for Open-Domain Sound Generation

MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture Generation

MDTv2: Masked Diffusion Transformer is a Strong Image Synthesizer

QA-MDT: Quality-aware Masked Diffusion Transformer for Enhanced Music Generation

SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond

Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis

Scaling up Masked Diffusion Models on Text

Taming Data and Transformers for Audio Generation

Masked Modeling Duo: Towards a Universal Audio Pre-training Framework

Fast Training of Diffusion Models with Masked Transformers

ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer

Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

U-DiT TTS: U-Diffusion Vision Transformer for Text-to-Speech

Bag of Design Choices for Inference of High-Resolution Masked Generative Transformer

MM-LDM: Multi-Modal Latent Diffusion Model for Sounding Video Generation

DiffiT: Diffusion Vision Transformers for Image Generation

VDT: General-purpose Video Diffusion Transformers via Mask Modeling

Simple and Effective Masked Diffusion Language Models

DiMSUM: Diffusion Mamba -- A Scalable and Unified Spatial-Frequency Method for Image Generation