3D-Speaker-Toolkit: An Open-Source Toolkit for Multimodal Speaker Verification and Diarization

Yafeng Chen,Siqi Zheng,Hui Wang,Luyao Cheng,Tinglong Zhu,Rongjie Huang,Chong Deng,Qian Chen,Shiliang Zhang,Wen Wang,Xihao Li

2024-09-17

Abstract:We introduce 3D-Speaker-Toolkit, an open-source toolkit for multimodal speaker verification and diarization, designed for meeting the needs of academic researchers and industrial practitioners. The 3D-Speaker-Toolkit adeptly leverages the combined strengths of acoustic, semantic, and visual data, seamlessly fusing these modalities to offer robust speaker recognition capabilities. The acoustic module extracts speaker embeddings from acoustic features, employing both fully-supervised and self-supervised learning approaches. The semantic module leverages advanced language models to comprehend the substance and context of spoken language, thereby augmenting the system's proficiency in distinguishing speakers through linguistic patterns. The visual module applies image processing technologies to scrutinize facial features, which bolsters the precision of speaker diarization in multi-speaker environments. Collectively, these modules empower the 3D-Speaker-Toolkit to achieve substantially improved accuracy and reliability in speaker-related tasks. With 3D-Speaker-Toolkit, we establish a new benchmark for multimodal speaker analysis. The toolkit also includes a handful of open-source state-of-the-art models and a large-scale dataset containing over 10,000 speakers. The toolkit is publicly available at <a class="link-external link-https" href="https://github.com/modelscope/3D-Speaker" rel="external noopener nofollow">this https URL</a>.

Audio and Speech Processing,Signal Processing

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper introduces an open-source toolkit named 3D-Speaker-Toolkit, which aims to enhance the performance of speaker verification and diarization tasks by integrating multiple modalities (acoustic, semantic, and visual information). Specifically: 1. **Multimodal Fusion**: - Traditional speaker recognition systems primarily rely on acoustic information and perform poorly in adverse acoustic environments. To overcome this limitation, 3D-Speaker-Toolkit combines acoustic, semantic, and visual information, improving the system's robustness and accuracy. 2. **Advanced Model Support**: - The toolkit supports both fully supervised learning (such as ECAPA-TDNN, ResNet34, Res2Net, etc.) and self-supervised learning (such as DINO, RDINO, SDPN, etc.) methods, and provides a large number of pre-trained models for users to use directly. 3. **Deployment and Production Environment Compatibility**: - It offers model export functionality, allowing trained models to be converted to ONNX format for easy use in deployment environments. Additionally, the toolkit provides ready-to-use models that users can load by simply calling the pre-trained speaker embedding extractors with a few lines of code. 4. **Large-Scale Dataset**: - A large-scale dataset named 3D-Speaker, containing over 10,000 speakers, is released. It covers various recording devices, different distances, and multiple dialects to address diverse application scenarios. 5. **Lightweight Design**: - The code is written based on the PyTorch ecosystem, simplifying the installation and usage process, and providing lightweight solutions. Through these improvements, 3D-Speaker-Toolkit aims to provide a powerful and flexible platform for academic researchers and industry practitioners to develop, train, and deploy state-of-the-art speaker-related models.

3D-Speaker-Toolkit: An Open-Source Toolkit for Multimodal Speaker Verification and Diarization

3D-Speaker: A Large-Scale Multi-Device, Multi-Distance, and Multi-Dialect Corpus for Speech Representation Disentanglement

A Toolkit for Joint Speaker Diarization and Identification with Application to Speaker-Attributed ASR

Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization

WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker Extraction

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.

UniTalker: Scaling up Audio-Driven 3D Facial Animation through A Unified Model

Exploring Speaker-Related Information in Spoken Language Understanding for Better Speaker Diarization

Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts

DyViSE: Dynamic Vision-Guided Speaker Embedding for Audio-Visual Speaker Diarization

Pushing the limits of self-supervised speaker verification using regularized distillation framework

USED: Universal Speaker Extraction and Diarization

Uncertainty-Guided End-to-End Audio-Visual Speaker Diarization for Far-Field Recordings

Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS

Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement

Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection

Multi-Stage Face-Voice Association Learning with Keynote Speaker Diarization

DualTalker: A Cross-Modal Dual Learning Approach for Speech-Driven 3D Facial Animation

Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization

AISHELL-4 - An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario.

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words