Abstract:This paper proposes a fully explainable approach to speaker verification (SV), a task that fundamentally relies on individual speaker characteristics. The opaque use of speaker attributes in current SV systems raises concerns of trust. Addressing this, we propose an attribute-based explainable SV system that identifies speakers by comparing personal attributes such as gender, nationality, and age extracted automatically from voice recordings. We believe this approach better aligns with human reasoning, making it more understandable than traditional methods. Evaluated on the Voxceleb1 test set, the best performance of our system is comparable with the ground truth established when using all correct attributes, proving its efficacy. Whilst our approach sacrifices some performance compared to non-explainable methods, we believe that it moves us closer to the goal of transparent, interpretable AI and lays the groundwork for future enhancements through attribute expansion.

What problem does this paper attempt to address?

This paper proposes an interpretable attribute-based approach for Speaker Verification (SV) to address the trust issue caused by the opaque use of speaker features in current SV systems. Traditional SV systems rely on individual features, but the way these features are used may be opaque, making the decision process difficult to understand and explain. The paper presents an attribute-based interpretable SV system that identifies speakers by automatically extracting personal attributes such as gender, nationality, and age from speech recordings. This approach is considered to be more in line with human reasoning, making it easier to understand compared to traditional methods. Evaluation on the VoxCeleb1 test set shows that the performance of this system is comparable to the baseline established when using all correct attributes, demonstrating its effectiveness. Although this approach sacrifices performance compared to non-interpretable methods, the authors believe that it moves towards the goal of transparent and interpretable artificial intelligence, laying the foundation for future improvements through attribute extensions. The experimental part compares the performance of different stage 1 attribute classifiers (Xvector, ECAPA, and AC) and finds that the AC classifier trained directly on MFCC input performs better than Xvector and ECAPA trained with pre-trained embeddings in certain cases. The paper also explores the difference between using softmax labels and hard labels for similarity computation in the second stage, and finds that softmax labels can reduce the Equal Error Rate (EER) and improve model performance. The research also finds that occupation and nationality are key attributes for differentiating speakers, while age has a smaller influence. Although the current system's performance is slightly lower than non-interpretable methods, the authors plan to improve performance by incorporating more relevant attributes while maintaining the advantage of interpretability.

Explainable Attribute-Based Speaker Verification

VarASV: Enabling Pitch-variable Automatic Speaker Verification Via Multi-task Learning

Exploring Universal Speech Attributes for Speaker Verification with an Improved Cross-stitch Network

Explaining Speech Classification Models via Word-Level Audio Segments and Paralinguistic Features

An Explainable Probabilistic Attribute Embedding Approach for Spoofed Speech Characterization

Explainability of Automated Fact Verification Systems: A Comprehensive Review

Can We Trust Explainable AI Methods on ASR? An Evaluation on Phoneme Recognition

An Attribute-Aligned Strategy for Learning Speech Representation

Unveiling hidden factors: explainable AI for feature boosting in speech emotion recognition

Cross-lingual Speaker Verification with Deep Feature Learning.

Explanations for Automatic Speech Recognition

Bridging Human Concepts and Computer Vision for Explainable Face Verification

Improving speaker verification robustness with synthetic emotional utterances

An Attention-Based Method for Guiding Attribute-Aligned Speech Representation Learning

FA-ExU-Net: the simultaneous training of an embedding extractor and enhancement model for a speaker verification system robust to short noisy utterances

Detecting Deepfake Voice Using Explainable Deep Learning Techniques

Towards an Interpretable Representation of Speaker Identity via Perceptual Voice Qualities

SVEva Fair: A Framework for Evaluating Fairness in Speaker Verification

ACA-Net: Towards Lightweight Speaker Verification using Asymmetric Cross Attention

Large-Scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification