Abstract:Recently, biometric authorizations using fingerprint, voiceprint, and facial features have garnered considerable attention from the public with the development of recognition techniques and popularization of the smartphone. Among such biometrics, voiceprint has a personal identity as high as that of fingerprint and also uses a noncontact mode to recognize similar faces. Speech signal-processing is one of the keys to accuracy in voice recognition. Most voice-identification systems still employ the mel-scale frequency cepstrum coefficient (MFCC) as the key vocal feature. The quality and accuracy of the MFCC are dependent on the prepared phrase, which belongs to text-dependent speaker identification. In contrast, several new features, such as d-vector, provide a black-box process in vocal feature learning. To address these aspects, a novel data-driven approach for vocal feature extraction based on a decision-support system (DSS) is proposed in this study. Each speech signal can be transformed into a vector representing the vocal features using this DSS. The establishment of this DSS involves three steps: (i) voice data preprocessing, (ii) hierarchical cluster analysis for the inverse discrete cosine transform cepstrum coefficient, and (iii) learning the E-vector through minimization of the Euclidean metric. We compare experiments to verify the E-vectors extracted by this DSS with other vocal features measures and apply them to both text-dependent and text-independent datasets. In the experiments containing one utterance of each speaker, the average accuracy of the E-vector is improved by approximately 1.5% over the MFCC. In the experiments containing multiple utterances of each speaker, the average micro-F1 score of the E-vector is also improved by approximately 2.1% over the MFCC. The results of the E-vector show remarkable advantages when applied to both the Texas Instruments/Massachusetts Institute of Technology corpus and LibriSpeech corpus. These improvements of the E-vector contribute to the capabilities of speaker identification and also enhance its usability for more real-world identification tasks.

Eigenspace Estimation With Missing Values And Its Application To Eigenvoice Adaptation For Speech Recognition

Maximum Likelihood I-Vector Space Using PCA for Speaker Verification.

Latent Correlation Analysis of HMM Parameters for Speech Recognition

A Novel and Efficient Voice Activity Detector Using Shape Features of Speech Wave.

Eigenvoice-based MAP Adaptation Within Correlation Subspace

A New Subspace Based Speaker Adaptation Method

Discriminative Speaker Adaptation with Eigenvoices

Eigenvoice-based MAP Fast Adaptation in Correlation Subspaces

Rapid discriminative acoustic model based on eigenspace mapping for fast speaker adaptation

Improving Online Incremental Speaker Adaptation with Eigen Feature Space MLLR.

Combining Eigenvoice Speaker Modeling And Vts-Based Environment Compensation For Robust Speech Recognition

Experimental evaluation of a new speaker identification framework using PCA.

A Novel I-Vector Framework Using Multiple Features and PCA for Speaker Recognition in Short Speech Condition

Model Adaptation Using the Projection to Latent Structure Algorithm

Speaker-independent oriented research on features of speech emotion recognition

An eigenvalue filtering based subspace approach for speech enhancement

Voice Conversion towards Arbitrary Speakers With Limited Data.

Spatial Correlated Maximum A Posteriori Adaptation Algorithm

Phoneme Dependent Speaker Embedding And Model Factorization For Multi-Speaker Speech Synthesis And Adaptation

Data-Driven Decision-Support System for Speaker Identification Using E-Vector System

Self-supervised Predictive Coding Models Encode Speaker and Phonetic Information in Orthogonal Subspaces