Abstract:Visual and audio modalities are two symbiotic modalities underlying videos, which contain both common and complementary information. If they can be mined and fused sufficiently, performances of related video tasks can be significantly enhanced. However, due to the environmental interference or sensor fault, sometimes, only one modality exists while the other is abandoned or missing. By recovering the missing modality from the existing one based on the common information shared between them and the prior information of the specific modality, great bonus will be gained for various vision tasks. In this paper, we propose a Cross-Modal Cycle Generative Adversarial Network (CMCGAN) to handle cross-modal visual-audio mutual generation. Specifically, CMCGAN is composed of four kinds of subnetworks: audio-to-visual, visual-to-audio, audio-to-audio and visual-to-visual subnetworks respectively, which are organized in a cycle architecture. CMCGAN has several remarkable advantages. Firstly, CMCGAN unifies visual-audio mutual generation into a common framework by a joint corresponding adversarial loss. Secondly, through introducing a latent vector with Gaussian distribution, CMCGAN can handle dimension and structure asymmetry over visual and audio modalities effectively. Thirdly, CMCGAN can be trained end-to-end to achieve better convenience. Benefiting from CMCGAN, we develop a dynamic multimodal classification network to handle the modality missing problem. Abundant experiments have been conducted and validate that CMCGAN obtains the state-of-the-art cross-modal visual-audio generation results. Furthermore, it is shown that the generated modality achieves comparable effects with those of original modality, which demonstrates the effectiveness and advantages of our proposed method.

Research on Visual‐tactile Cross‐modality Based on Generative Adversarial Network

Learning Cross-Modal Visual-Tactile Representation Using Ensembled Generative Adversarial Networks.

Toward Image-to-Tactile Cross-Modal Perception for Visually Impaired People

X-Gacmn: An X-Shaped Generative Adversarial Cross-Modal Network With Hypersphere Embedding

“Touching to See” and “Seeing to Feel”: Robotic Cross-modal Sensory Data Generation for Visual-Tactile Perception

Vibrotactile Signal Generation from Texture Images or Attributes using Generative Adversarial Network

TexSenseGAN: A User-Guided System for Optimizing Texture-Related Vibrotactile Feedback Using Generative Adversarial Network

A Wearable Vision-To-Audio Sensory Substitution Device for Blind Assistance and the Correlated Neural Substrates

Bidirectional visual-tactile cross-modal generation using latent feature space flow model

Deep Cross-Modal Audio-Visual Generation

CMCGAN: A Uniform Framework for Cross-Modal Visual-Audio Mutual Generation

A Step towards Automated and Generalizable Tactile Map Generation using Generative Adversarial Networks

TextToucher: Fine-Grained Text-to-Touch Generation

Controllable Visual-Tactile Synthesis

TAVT: Towards Transferable Audio-Visual Text Generation.

Event-Based Multimodal Spiking Neural Network with Attention Mechanism

A Vision-Based Tactile Sensing System for Multimodal Contact Information Perception via Neural Network

Listen to the Image

Multi-modal Transformer-based Tactile Signal Generation for Haptic Texture Simulation of Materials in Virtual and Augmented Reality

Vision-based Tactile Image Generation via Contact Condition-guided Diffusion Model