Oğuzhan Kurnaz,Jagabandhu Mishra,Tomi H. Kinnunen,Cemal Hanilçi
Abstract:Automatic speaker verification (ASV) systems are vulnerable to spoofing attacks. We propose a spoofing-robust ASV system optimized directly for the recently introduced architecture-agnostic detection cost function (a-DCF), which allows targeting a desired trade-off between the contradicting aims of user convenience and robustness to spoofing. We combine a-DCF and binary cross-entropy (BCE) with a novel straightforward threshold optimization technique. Our results with an embedding fusion system on ASVspoof2019 data demonstrate relative improvement of $13\%$ over a system trained using BCE only (from minimum a-DCF of $0.1445$ to $0.1254$). Using an alternative non-linear score fusion approach provides relative improvement of $43\%$ (from minimum a-DCF of $0.0508$ to $0.0289$).
What problem does this paper attempt to address?
This paper attempts to solve the problem that Automatic Speaker Verification (ASV) systems are vulnerable to spoofing attacks. Specifically, the paper proposes an anti - spoofing speaker verification system optimized directly for the architecture - agnostic Detection Cost Function (a - DCF). By combining a - DCF and Binary Cross - Entropy (BCE) loss and introducing a novel threshold optimization technique, this research aims to improve the robustness and performance of the speaker verification system, thereby achieving a better balance between user convenience and security.
### Main Problems and Solutions
1. **Problem Description**:
- **Vulnerability to Spoofing Attacks**: Existing ASV systems are easily exploited by spoofing attacks (such as replay attacks, text - to - speech synthesis, etc.).
- **Limitations of Evaluation Metrics**: The traditional t - DCF (tandem Detection Cost Function) is only applicable to tandem architectures and cannot be widely applied to other types of systems.
2. **Solutions**:
- **Introduction of a - DCF**: a - DCF is a new detection cost function that can evaluate anti - spoofing speaker verification systems of different architectures with only one set of detection scores and one detection threshold.
- **Softening a - DCF**: Since a - DCF is based on hard error counting and is not differentiable, the paper proposes to "soften" a - DCF into a differentiable form so that it can be optimized using the gradient descent method.
- **Joint Optimization of Model Parameters and Thresholds**: Through Algorithm 1, the neural network weights and the detection threshold are simultaneously optimized to minimize the a - DCF loss.
### Experimental Results
The paper shows through experiments the performance improvement of the optimized system on the ASVspoof2019 dataset. Compared with the baseline system using only BCE loss, the system combining a - DCF and BCE loss has a significant improvement in the minimum a - DCF value. In addition, through further optimization of the threshold, the system performance is further improved.
### Key Formulas
- **a - DCF Formula**:
\[
a\text{-}DCF(\tau_{sasv}) = C_{tar}^{miss} \cdot \pi_{tar} \cdot P_{tar}^{miss}(\tau_{sasv}) + C_{non}^{fa} \cdot \pi_{non} \cdot P_{non}^{fa}(\tau_{sasv}) + C_{spf}^{fa} \cdot \pi_{spf} \cdot P_{spf}^{fa}(\tau_{sasv})
\]
where:
- \( C_{tar}^{miss} \) and \( C_{non}^{fa} \) are the costs of target miss and non - target false alarm respectively;
- \( \pi_{tar} \), \( \pi_{non} \), and \( \pi_{spf} \) are the prior probabilities of target, non - target, and spoofing attack respectively;
- \( P_{tar}^{miss} \), \( P_{non}^{fa} \), and \( P_{spf}^{fa} \) are the target miss rate, non - target false alarm rate, and spoofing false alarm rate respectively;
- \( \tau_{sasv} \) is the detection threshold.
- **Softened Error Rate**:
\[
\hat{P}_{tar}^{miss}(\tau_{sasv}) = \frac{1}{N_{tar}} \sum_{x \in tar} \sigma(\tau_{sasv} - g(x))
\]
\[
\hat{P}_{non}^{fa}(\tau_{sasv}) = \frac{1}{N_{non}} \sum_{x \in non} \sigma(g(x) - \