Speaker Detection in the Wild: Lessons Learned from JSALT 2019

Leibny Paola García-Perera,Jesús Villalba,Hervé Bredin,Jun Du,Diego Castán,Alejandrina Cristià,Latané Bullock,Ling Guo,Koji Okabe,Phani Sankar Nidadavolu,Saurabh Kataria,Sizhu Chen,Léo Galmant,Marvin Lavechin,Lei Sun,Marie-Philippe Gill,Bar Ben-Yair,Sajjad Abdoli,Xin Wang,Wassim Bouaziz,Hadrien Titeux,Emmanuel Dupoux,Kong Aik Lee,Najim Dehak
DOI: https://doi.org/10.21437/odyssey.2020-59
2020-01-01
Abstract:This paper presents the problems and solutions addressed at the JSALT workshop when using a single microphone for speaker detection in adverse scenarios. The main focus was to tackle a wide range of conditions that go from meetings to wild speech. We describe the research threads we explored and a set of modules that was successful for these scenarios. The ultimate goal was to explore speaker detection; but our first finding was that an effective diarization improves detection, and not having a diarization stage impoverishes the performance. All the different configurations of our research agree on this fact and follow a main backbone that includes diarization as a previous stage. With this backbone, we analyzed the following problems: voice activity detection, how to deal with noisy signals, domain mismatch, how to improve the clustering; and the overall impact of previous stages in the final speaker detection. In this paper, we show partial results for speaker diarizarion to have a better understanding of the problem and we present the final results for speaker detection.
What problem does this paper attempt to address?