Effect of utterance duration and phonetic content on speaker identification using second-order statistical methods

Ivan Magrin-Chagnolleau,Jean François Bonastre,Frédéric Bimbot
2024-02-26
Abstract:Second-order statistical methods show very good results for automatic speaker identification in controlled recording conditions. These approaches are generally used on the entire speech material available. In this paper, we study the influence of the content of the test speech material on the performances of such methods, i.e. under a more analytical approach. The goal is to investigate on the kind of information which is used by these methods, and where it is located in the speech signal. Liquids and glides together, vowels, and more particularly nasal vowels and nasal consonants, are found to be particularly speaker specific: test utterances of 1 second, composed in majority of acoustic material from one of these classes provide better speaker identification results than phonetically balanced test utterances, even though the training is done, in both cases, with 15 seconds of phonetically balanced speech. Nevertheless, results with other phoneme classes are never dramatically poor. These results tend to show that the speaker-dependent information captured by long-term second-order statistics is consistently common to all phonetic classes, and that the homogeneity of the test material may improve the quality of the estimates.
Information Retrieval,Signal Processing
What problem does this paper attempt to address?
This paper investigates how pronunciation duration and phonetic content affect speaker recognition performance based on second-order statistical methods in speech recognition. The study uses imbalanced speech materials and constructs test patterns based on specific phonemes or phoneme categories to observe the performance of these methods under different speech contents. The experimental results show that liquids and vowels (especially nasal vowels and nasal consonants) have higher speaker characteristics. Test speech composed of 1-second acoustic materials from these categories provides better recognition performance compared to phoneme-balanced test speech, even though both training scenarios use 1.5-second phoneme-balanced speech. In addition, while the results for other phoneme categories are not inferior, they do not significantly degrade. The study also finds that long-term second-order statistical information is consistent across all phoneme categories, and the homogeneity of the test materials may improve estimation quality. The conclusion of the paper is that the phonetic content of speech has a significant impact on speaker recognition performance, especially for certain phoneme categories, and speaker-relevant information can be captured even when the training materials do not match.