Abstract:Speaker diarization, the process of segmenting an audio stream or transcribed speech content into homogenous partitions based on speaker identity, plays a crucial role in the interpretation and analysis of human speech. Most existing speaker diarization systems rely exclusively on unimodal acoustic information, making the task particularly challenging due to the innate ambiguities of audio signals. Recent studies have made tremendous efforts towards audio-visual or audio-semantic modeling to enhance performance. However, even the incorporation of up to two modalities often falls short in addressing the complexities of spontaneous and unstructured conversations. To exploit more meaningful dialogue patterns, we propose a novel multimodal approach that jointly utilizes audio, visual, and semantic cues to enhance speaker diarization. Our method elegantly formulates the multimodal modeling as a constrained optimization problem. First, we build insights into the visual connections among active speakers and the semantic interactions within spoken content, thereby establishing abundant pairwise constraints. Then we introduce a joint pairwise constraint propagation algorithm to cluster speakers based on these visual and semantic constraints. This integration effectively leverages the complementary strengths of different modalities, refining the affinity estimation between individual speaker embeddings. Extensive experiments conducted on multiple multimodal datasets demonstrate that our approach consistently outperforms state-of-the-art speaker diarization methods.

Improving Speaker Segmentation Via Speaker Identification and Text Segmentation

Exploring Speaker-Related Information in Spoken Language Understanding for Better Speaker Diarization

Using Phoneme Recognition and Text-Dependent Speaker Verification to Improve Speaker Segmentation for Chinese Speech.

Improving Speaker Diarization by Cross EM Refinement

Speaker Segmentation and Clustering in Meetings

An Integrated Top-Down/Bottom-Up Approach To Speaker Diarization

Speaker Diarization Using EHMM and CLR

Once more Diarization: Improving meeting transcription systems through segment-level speaker reassignment

Multi-speaker Segmentation and Clustering of Telephone Speech

Improving Separation-Based Speaker Diarization Via Iterative Model Refinement And Speaker Embedding Based Post-Processing

A Real-time Speaker Diarization System Based on Spatial Spectrum

Improving Speaker Diarization using Semantic Information: Joint Pairwise Constraints Propagation

VB-HMM Speaker Diarization with Enhanced and Refined Segment Representation.

Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization

Real-time Speaker Detection in Conversational Speech

An Improved Speaker Based Speech Segmentation Algorithm

A Quick and Effective Speaker Diarization System.

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.

End-to-end speaker segmentation for overlap-aware resegmentation

QDM-SSD: Quality-Aware Dynamic Masking for Separation-Based Speaker Diarization

An Improved Speaker Diarization System for Multiple Distance Microphone Meetings