Abstract:Audio-visual tracks in video contain rich semantic information with potential in many applications and research. Since the audio-visual data have inconsistent distributions and because of the heterogeneous nature of representations, the heterogeneous gap between modalities makes them impossible to compare directly. To bridge the modality gap, a frequently adopted approach is to simultaneously project audio-visual data into a common subspace to capture the commonalities and characteristics of modalities for measurement, which has been extensively studied in relation to the issues of modality-common and modality-specific feature learning in previous research. However, it is difficult for existing methods to address the tradeoff between both issues; e.g., the modality-common feature is learned from the latent commonalities of audio-visual data or the correlated features as aligned projections, in which the modality-specific feature can be lost. To solve the tradeoff, we propose a novel end-to-end architecture, which synchronously projects audio-visual data into the explicit and the implicit dual common subspaces. The explicit subspace is used to learn modality-common features and reduce the modality gap of explicitly paired audio-visual data, where the representation-specific details are abandoned to retain the common underlying structure of audio-visual data. The implicit subspace is used to learn modality-specific features, where each modality privately pulls apart the feature distances between different categories to maintain the category-based distinctions, by minimizing the distance between audio-visual features and corresponding labels. The comprehensive experimental results on two audio-visual datasets, VEGAS and AVE, demonstrate that our proposed model for using two different common subspaces for audio-visual cross-modal learning is effective and significantly outperforms the state-of-the-art cross-modal models that learn features from a single common subspace by 4.30% and 2.30% in terms of average MAP on the VEGAS and AVE datasets, respectively.

An evaluation of bone induction delivery materials in conjunction with root-form implant placement.

“Touching to See” and “Seeing to Feel”: Robotic Cross-modal Sensory Data Generation for Visual-Tactile Perception

Research on Visual‐tactile Cross‐modality Based on Generative Adversarial Network

X-Gacmn: An X-Shaped Generative Adversarial Cross-Modal Network With Hypersphere Embedding

Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval

Learning Cross-Modal Visual-Tactile Representation Using Ensembled Generative Adversarial Networks.

Toward Image-to-Tactile Cross-Modal Perception for Visually Impaired People

A Human-Like Siamese-Based Visual-Tactile Fusion Model for Object Recognition

Seeing By Touching: Cross-Modal Matching For Tactile And Vision Measurements

A Vision-Based Tactile Sensing System for Multimodal Contact Information Perception via Neural Network

Deep Cross-Modal Audio-Visual Generation

Active Visual-Tactile Cross-Modal Matching.

Learning Audio-Visual Correlations from Variational Cross-Modal Generation

Using a Vertical-Stream Variational Auto-Encoder to Generate Segment-Based Images and Its Biological Plausibility for Modelling the Visual Pathways.

Investigation of turbomixers in continuous flow analysis.

Learning Explicit and Implicit Dual Common Subspaces for Audio-visual Cross-modal Retrieval

Controllable Visual-Tactile Synthesis

Cross-Modal Deep Variational Hand Pose Estimation

TextToucher: Fine-Grained Text-to-Touch Generation

Category Decoding of Visual Stimuli From Human Brain Activity Using a Bidirectional Recurrent Neural Network to Simulate Bidirectional Information Flows in Human Visual Cortices