Speaker Detection by the Individual Listener and the Crowd: Parametric Models Applicable to Bonafide and Deepfake Speech

Tomi H. Kinnunen,Rosa Gonzalez Hautamäki,Xin Wang,Junichi Yamagishi
DOI: https://doi.org/10.21437/interspeech.2024-1704
2024-01-01
Abstract:Subjective speaker detection, whether for bonafide (real) or spoofed (fake) speech, is often implemented through crowdsourcing to facilitate comparison of systems, with less attention paid to the source of the ratings--the listener. We characterize speaker detection both at the level of listener and the crowd. Each listener possesses certain sensitivity and bias for observing speaker differences. By combining detection model with random between-listener effects, we obtain a generalized linear mixed effects (GLME) model, demonstrated here for two different tasks. The first one involves bonafide data from VoxCeleb1 under a biased set-up containing varied role-play instructions; the second one, focused on spoofing, presents re-analysis of the ASVspoof 2019 subjective data. Our GLME enables sampling listeners and obtaining parametric detection error trade-off (DET) profiles and equal error rates (EERs).
What problem does this paper attempt to address?