Abstract:Visual and audio modalities are two symbiotic modalities underlying videos, which contain both common and complementary information. If they can be mined and fused sufficiently, performances of related video tasks can be significantly enhanced. However, due to the environmental interference or sensor fault, sometimes, only one modality exists while the other is abandoned or missing. By recovering the missing modality from the existing one based on the common information shared between them and the prior information of the specific modality, great bonus will be gained for various vision tasks. In this paper, we propose a Cross-Modal Cycle Generative Adversarial Network (CMCGAN) to handle cross-modal visual-audio mutual generation. Specifically, CMCGAN is composed of four kinds of subnetworks: audio-to-visual, visual-to-audio, audio-to-audio and visual-to-visual subnetworks respectively, which are organized in a cycle architecture. CMCGAN has several remarkable advantages. Firstly, CMCGAN unifies visual-audio mutual generation into a common framework by a joint corresponding adversarial loss. Secondly, through introducing a latent vector with Gaussian distribution, CMCGAN can handle dimension and structure asymmetry over visual and audio modalities effectively. Thirdly, CMCGAN can be trained end-to-end to achieve better convenience. Benefiting from CMCGAN, we develop a dynamic multimodal classification network to handle the modality missing problem. Abundant experiments have been conducted and validate that CMCGAN obtains the state-of-the-art cross-modal visual-audio generation results. Furthermore, it is shown that the generated modality achieves comparable effects with those of original modality, which demonstrates the effectiveness and advantages of our proposed method.

Multimodal Dialogue Generation Based on Transformer and Collaborative Attention

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

Multimodal Dialogue Response Generation Based on Selective Attention and Gating Mechanisms

Some Can Be Better Than All: Multimodal Star Transformer for Visual Dialog

Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation

Video Dialog Via Multi-Grained Convolutional Self-Attention Context Multi-Modal Networks

Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog

Efficient Attention Mechanism for Visual Dialog that can Handle All the Interactions between Multiple Inputs

Multimodal Graph Transformer for Multimodal Question Answering

Video Dialog Via Progressive Inference and Cross-Transformer.

User Attention-guided Multimodal Dialog Systems

Modality-Balanced Models for Visual Dialogue

Multi-View Attention Network for Visual Dialog

Video Dialog Via Multi-Grained Convolutional Self-Attention Context Networks

Recurrent Attention Network with Reinforced Generator for Visual Dialog

Entropy-Enhanced Multimodal Attention Model for Scene-Aware Dialogue Generation

Multimodal-enhanced hierarchical attention network for video captioning

Multimodal Dialogue Systems via Capturing Context-aware Dependencies and Ordinal Information of Semantic Elements

DAM: Deliberation, Abandon and Memory Networks for Generating Detailed and Non-repetitive Responses in Visual Dialogue

CMCGAN: A Uniform Framework for Cross-Modal Visual-Audio Mutual Generation

A non-hierarchical attention network with modality dropout for textual response generation in multimodal dialogue systems