Abstract:Active speaker detection and speech enhancement have become two increasingly attractive topics in audio-visual scenario understanding. According to their respective characteristics, the scheme of independently designed architecture has been widely used in correspondence to each single task. This may lead to the representation learned by the model being task-specific, and inevitably result in the lack of generalization ability of the feature based on multi-modal modeling. More recent studies have shown that establishing cross-modal relationship between auditory and visual stream is a promising solution for the challenge of audio-visual multi-task learning. Therefore, as a motivation to bridge the multi-modal associations in audio-visual tasks, a unified framework is proposed to achieve target speaker detection and speech enhancement with joint learning of audio-visual modeling in this study. With the assistance of audio-visual channels of videos in challenging real-world scenarios, the proposed method is able to exploit inherent correlations in both audio and visual signals, which is used to further anticipate and model the temporal audio-visual relationships across spatial-temporal space via a cross-modal conformer. In addition, a plug-and-play multi-modal layer normalization is introduced to alleviate the distribution misalignment of multi-modal features. Based on cross-modal circulant fusion, the proposed model is capable to learned all audio-visual representations in a holistic process. Substantial experiments demonstrate that the correlations between different modalities and the associations among diverse tasks can be learned by the optimized model more effectively. In comparison to other state-of-the-art works, the proposed work shows a superior performance for active speaker detection and audio-visual speech enhancement on three benchmark datasets, also with a favorable generalization in diverse challenges. Code is available at: https://github.com/Overcautious/ADENet.

Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks

AudioVSR: Enhancing Video Speech Recognition with Audio Data

CrossMAE: Cross Modality Masked Autoencoders for Region-Aware Audio-Visual Pretraining

Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval

Exploring the Transferability of Visual Prompting for Multimodal Large Language Models

Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond

MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models

Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model

Cross-Modal Adapter for Text-Video Retrieval

Deep Cross-Modal Audio-Visual Generation

CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing

Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement

Multi-Modal Adapter for Vision-Language Models

Understanding the Multi-modal Prompts of the Pre-trained Vision-Language Model

Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting

Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder

Cross-Modal Denoising: A Novel Training Paradigm for Enhancing Speech-Image Retrieval

Fine-Grained Visual Prompt Learning of Vision-Language Models for Image Recognition

CMCGAN: A Uniform Framework for Cross-Modal Visual-Audio Mutual Generation

Modality-invariant and Specific Prompting for Multimodal Human Perception Understanding

A Study on Joint Modeling and Data Augmentation of Multi-Modalities for Audio-Visual Scene Classification