Abstract:Visual and audio modalities are two symbiotic modalities underlying videos, which contain both common and complementary information. If they can be mined and fused sufficiently, performances of related video tasks can be significantly enhanced. However, due to the environmental interference or sensor fault, sometimes, only one modality exists while the other is abandoned or missing. By recovering the missing modality from the existing one based on the common information shared between them and the prior information of the specific modality, great bonus will be gained for various vision tasks. In this paper, we propose a Cross-Modal Cycle Generative Adversarial Network (CMCGAN) to handle cross-modal visual-audio mutual generation. Specifically, CMCGAN is composed of four kinds of subnetworks: audio-to-visual, visual-to-audio, audio-to-audio and visual-to-visual subnetworks respectively, which are organized in a cycle architecture. CMCGAN has several remarkable advantages. Firstly, CMCGAN unifies visual-audio mutual generation into a common framework by a joint corresponding adversarial loss. Secondly, through introducing a latent vector with Gaussian distribution, CMCGAN can handle dimension and structure asymmetry over visual and audio modalities effectively. Thirdly, CMCGAN can be trained end-to-end to achieve better convenience. Benefiting from CMCGAN, we develop a dynamic multimodal classification network to handle the modality missing problem. Abundant experiments have been conducted and validate that CMCGAN obtains the state-of-the-art cross-modal visual-audio generation results. Furthermore, it is shown that the generated modality achieves comparable effects with those of original modality, which demonstrates the effectiveness and advantages of our proposed method.

Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer

Multimodal Image-to-Image Translation via Mutual Information Estimation and Maximization

Not All Image Regions Matter: Masked Vector Quantization for Autoregressive Image Generation

Towards Accurate Image Coding: Improved Autoregressive Image Generation with Dynamic Vector Quantization

Semantic Image Synthesis with Semantically Coupled VQ-Model

Multimodal Disentanglement Variational AutoEncoders for Zero-Shot Cross-Modal Retrieval

Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads

Shared Predictive Cross-Modal Deep Quantization.

Joint Multimodal Learning with Deep Generative Models

Gaussian Mixture Vector Quantization with Aggregated Categorical Posterior

Vector Quantized Time Series Generation with a Bidirectional Prior Model

CMCGAN: A Uniform Framework for Cross-Modal Visual-Audio Mutual Generation

Cross-Modal Generation and Pair Correlation Alignment Hashing

Autoregressive Image Generation without Vector Quantization

JetFormer: An Autoregressive Generative Model of Raw Images and Text

Vector Quantized Diffusion Model for Text-to-Image Synthesis

Improving Bi-directional Generation between Different Modalities with Variational Autoencoders

Deep Cross-Modal Audio-Visual Generation

Cross-Modal Quantization for Co-Speech Gesture Generation

MAGVLT: Masked Generative Vision-and-Language Transformer