Explainable Attribute-Based Speaker Verification

Xiaoliang Wu,Chau Luu,Peter Bell,Ajitha Rajan
2024-05-30
Abstract:This paper proposes a fully explainable approach to speaker verification (SV), a task that fundamentally relies on individual speaker characteristics. The opaque use of speaker attributes in current SV systems raises concerns of trust. Addressing this, we propose an attribute-based explainable SV system that identifies speakers by comparing personal attributes such as gender, nationality, and age extracted automatically from voice recordings. We believe this approach better aligns with human reasoning, making it more understandable than traditional methods. Evaluated on the Voxceleb1 test set, the best performance of our system is comparable with the ground truth established when using all correct attributes, proving its efficacy. Whilst our approach sacrifices some performance compared to non-explainable methods, we believe that it moves us closer to the goal of transparent, interpretable AI and lays the groundwork for future enhancements through attribute expansion.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
This paper proposes an interpretable attribute-based approach for Speaker Verification (SV) to address the trust issue caused by the opaque use of speaker features in current SV systems. Traditional SV systems rely on individual features, but the way these features are used may be opaque, making the decision process difficult to understand and explain. The paper presents an attribute-based interpretable SV system that identifies speakers by automatically extracting personal attributes such as gender, nationality, and age from speech recordings. This approach is considered to be more in line with human reasoning, making it easier to understand compared to traditional methods. Evaluation on the VoxCeleb1 test set shows that the performance of this system is comparable to the baseline established when using all correct attributes, demonstrating its effectiveness. Although this approach sacrifices performance compared to non-interpretable methods, the authors believe that it moves towards the goal of transparent and interpretable artificial intelligence, laying the foundation for future improvements through attribute extensions. The experimental part compares the performance of different stage 1 attribute classifiers (Xvector, ECAPA, and AC) and finds that the AC classifier trained directly on MFCC input performs better than Xvector and ECAPA trained with pre-trained embeddings in certain cases. The paper also explores the difference between using softmax labels and hard labels for similarity computation in the second stage, and finds that softmax labels can reduce the Equal Error Rate (EER) and improve model performance. The research also finds that occupation and nationality are key attributes for differentiating speakers, while age has a smaller influence. Although the current system's performance is slightly lower than non-interpretable methods, the authors plan to improve performance by incorporating more relevant attributes while maintaining the advantage of interpretability.