Abstract:Overlapped speech detection (OSD) is critical for speech applications in scenario of multi-party conversion. Despite numerous research efforts and progresses, comparing with speech activity detection (VAD), OSD remains an open challenge and its overall performance is far from satisfactory. The majority of prior research typically formulates the OSD problem as a standard classification problem, to identify speech with binary (OSD) or three-class label (joint VAD and OSD) at frame level. In contrast to the mainstream, this study investigates the joint VAD and OSD task from a new perspective. In particular, we propose to extend traditional classification network with multi-exit architecture. Such an architecture empowers our system with unique capability to identify class using either low-level features from early exits or high-level features from last exit. In addition, two training schemes, knowledge distillation and dense connection, are adopted to further boost our system performance. Experimental results on benchmark datasets (AMI and DIHARD-III) validated the effectiveness and generality of our proposed system. Our ablations further reveal the complementary contribution of proposed schemes. With $F_1$ score of 0.792 on AMI and 0.625 on DIHARD-III, our proposed system outperforms several top performing models on these datasets, but also surpasses the current state-of-the-art by large margins across both datasets. Besides the performance benefit, our proposed system offers another appealing potential for quality-complexity trade-offs, which is highly preferred for efficient OSD deployment.

Investigation of Spatial-Acoustic Features for Overlapping Speech Detection in Multiparty Meetings

Joint Speech Activity and Overlap Detection with Multi-Exit Architecture

Large-Scale Learning on Overlapped Speech Detection: New Benchmark and New General System

A Real-time Speaker Diarization System Based on Spatial Spectrum

Speaker Overlap-aware Neural Diarization for Multi-party Meeting Analysis

ASoBO: Attentive Beamformer Selection for Distant Speaker Diarization in Meetings

The xmuspeech system for multi-channel multi-party meeting transcription challenge

Channel-Combination Algorithms for Robust Distant Voice Activity and Overlapped Speech Detection

Multi-microphone Automatic Speech Segmentation in Meetings Based on Circular Harmonics Features

Simultaneous Speech Extraction for Multiple Target Speakers under the Meeting Scenarios

Joint speech and overlap detection: a benchmark over multiple audio setup and speech domains

On Sparse Bayesian Spreading Function Estimation Based Iterative Detection in Multiple-Input Multiple-Output Underwater Acoustic Communications

A Comparative Study on Multichannel Speaker-Attributed Automatic Speech Recognition in Multi-party Meetings

MFCCA:Multi-Frame Cross-Channel attention for multi-speaker ASR in Multi-party meeting scenario

Sound Event Localization and Detection Based on Multiple DOA Beamforming and Multi-Task Learning

Multi-Channel Multi-Speaker ASR Using Target Speaker's Solo Segment

Royalflush Speaker Diarization System for ICASSP 2022 Multi-channel Multi-party Meeting Transcription Challenge

Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings.

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.