Abstract:Recently, biometric authorizations using fingerprint, voiceprint, and facial features have garnered considerable attention from the public with the development of recognition techniques and popularization of the smartphone. Among such biometrics, voiceprint has a personal identity as high as that of fingerprint and also uses a noncontact mode to recognize similar faces. Speech signal-processing is one of the keys to accuracy in voice recognition. Most voice-identification systems still employ the mel-scale frequency cepstrum coefficient (MFCC) as the key vocal feature. The quality and accuracy of the MFCC are dependent on the prepared phrase, which belongs to text-dependent speaker identification. In contrast, several new features, such as d-vector, provide a black-box process in vocal feature learning. To address these aspects, a novel data-driven approach for vocal feature extraction based on a decision-support system (DSS) is proposed in this study. Each speech signal can be transformed into a vector representing the vocal features using this DSS. The establishment of this DSS involves three steps: (i) voice data preprocessing, (ii) hierarchical cluster analysis for the inverse discrete cosine transform cepstrum coefficient, and (iii) learning the E-vector through minimization of the Euclidean metric. We compare experiments to verify the E-vectors extracted by this DSS with other vocal features measures and apply them to both text-dependent and text-independent datasets. In the experiments containing one utterance of each speaker, the average accuracy of the E-vector is improved by approximately 1.5% over the MFCC. In the experiments containing multiple utterances of each speaker, the average micro-F1 score of the E-vector is also improved by approximately 2.1% over the MFCC. The results of the E-vector show remarkable advantages when applied to both the Texas Instruments/Massachusetts Institute of Technology corpus and LibriSpeech corpus. These improvements of the E-vector contribute to the capabilities of speaker identification and also enhance its usability for more real-world identification tasks.

Text-Dependent Speaker Recognition with Long-Term Features Based on Functional Data Analysis

Sentiment Analysis Using Deep Robust Complementary Fusion of Multi-Features and Multi-Modalities.

Time–Frequency Cepstral Features and Heteroscedastic Linear Discriminant Analysis for Language Recognition

Variant Time-Frequency Cepstral Features for Speaker Recognition

A text-dependent speaker verification application framework based on Chinese numerical string corpus

Short Utterance Speaker Recognition Based on Speech High Frequency Information Compensation and Dynamic Feature Enhancement Methods

Multi-feature Combination for Speaker Recognition

Continuous Target Speech Extraction: Enhancing Personalized Diarization and Extraction on Complex Recordings

Face Recognition Using Novel LDA-Based Algorithms.

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.

Multi-resolution Time Frequency Feature and Complementary Combination for Short Utterance Speaker Recognition

Improving Short-Duration Speaker Recognition by Joint Bark-Wavelet Acoustic Feature Coupling and Triplet Dual-Attention Mechanism Network

Data-Driven Decision-Support System for Speaker Identification Using E-Vector System

Robust Feature Extraction Using Temporal Context Averaging for Speaker Identification in Diverse Acoustic Environments

DS-TDNN: Dual-stream Time-delay Neural Network with Global-aware Filter for Speaker Verification

Time-frequency Network for Robust Speaker Recognition

Advanced Long-Content Speech Recognition With Factorized Neural Transducer

Speaker Recognition Using DMFCC over Telephone Channels

Speaker Change Detection for Transformer Transducer ASR

Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection

Improving Speaker Verification Performance Against Long-Term Speaker Variability