An audio-quality-based multi-strategy approach for target speaker extraction in the MISP 2023 Challenge

Runduo Han,Xiaopeng Yan,Weiming Xu,Pengcheng Guo,Jiayao Sun,He Wang,Quan Lu,Ning Jiang,Lei Xie

2024-03-07

Abstract:This paper describes our audio-quality-based multi-strategy approach for the audio-visual target speaker extraction (AVTSE) task in the Multi-modal Information based Speech Processing (MISP) 2023 Challenge. Specifically, our approach adopts different extraction strategies based on the audio quality, striking a balance between interference removal and speech preservation, which benifits the back-end automatic speech recognition (ASR) systems. Experiments show that our approach achieves a character error rate (CER) of 24.2% and 33.2% on the Dev and Eval set, respectively, obtaining the second place in the challenge.

Sound,Audio and Speech Processing

What problem does this paper attempt to address?

This paper proposes a solution to the Audio-Visual Target Speaker Extraction (AVTSE) problem in the Multimodal Information Processing (MISP) 2023 Challenge. The goal is to extract the speech of specific speakers in complex acoustic environments, including background noise and interference from multiple speakers. Traditional methods rely on pre-recorded audio of the target speaker, which limits their practicality. Therefore, the MISP 2023 Challenge introduces lip motion video data as prior information. In the paper, the authors propose a multi-strategy approach based on audio quality. Firstly, they divide the audio into three quality groups: high, medium, and low, based on the DNSMOS OVRL score, and apply different extraction techniques to each group. For high-quality audio, they directly use Guided Source Separation (GSS). For medium-quality audio, they combine the output of GSS with lip motion video data using a multi-channel fusion method for further extraction. For low-quality audio, they utilize the DRC-NET network for noise reduction. Experimental results show that this method achieved a Character Error Rate (CER) of 24.2% on the development set and 33.2% on the evaluation set, obtaining the second place in the challenge. The research also demonstrates that adopting different strategies for different audio qualities is crucial for the performance of the backend Automatic Speech Recognition (ASR) system.

An audio-quality-based multi-strategy approach for target speaker extraction in the MISP 2023 Challenge

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Summary on the Multimodal Information-Based Speech Processing (MISP) 2023 Challenge

The NPU-ASLP System for Audio-Visual Speech Recognition in MISP 2022 Challenge

Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention

Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy

Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

Target Speech Extraction Based on Blind Source Separation and X-vector-based Speaker Selection Trained with Data Augmentation

Enhancing Real-World Active Speaker Detection with Multi-Modal Extraction Pre-Training

Audio-Visual Active Speaker Extraction for Sparsely Overlapped Multi-talker Speech

MoMuSE: Momentum Multi-modal Target Speaker Extraction for Real-time Scenarios with Impaired Visual Cues

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Robust Audio-Visual Speech Enhancement: Correcting Misassignments in Complex Environments with Advanced Post-Processing

An Investigation into Audio–Visual Speech Recognition under a Realistic Home–TV Scenario

The Multimodal Information based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition

Audio-Visual Speech Recognition in MISP2021 Challenge: Dataset Release and Deep Analysis

MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition

SIR-Progressive Audio-Visual TF-Gridnet with ASR-Aware Selector for Target Speaker Extraction in MISP 2023 Challenge