Abstract:Speaker identification in challenging acoustic environments, influenced by noise, reverberation, and emotional fluctuations, requires improved feature extraction techniques. Although existing methods effectively extract distinct acoustic features, they show limitations in these adverse settings. To overcome these limitations, we propose the Temporal Context-Enhanced Features (TCEF) approach, which provides a consistent audio representation for better performance under various acoustic conditions. TCEF leverages a context window to average features in adjacent frames, effectively reducing short-term variations caused by noise, reverberation, fluctuations in emotional speech, and those in neutral recordings. This approach improves the distinctive features of a speaker voice, improving speaker identification in challenging and neutral acoustic environments. To evaluate the performance of TCEF against conventional features, One-Dimensional Convolutional Neural Network (1D-CNN) was used for a detailed frame-level analysis and Long Short-Term Memory (LSTM) for a comprehensive sequence-level analysis.We used four datasets to assess the effectiveness of the TCEF approach. The GRID and RAVDESS datasets represent neutral and emotional speech, respectively. To test the robustness of our system under adverse acoustic conditions, we created two additional datasets: GRID-NR and RAVDESS-NR. These are modified versions of the original GRID and RAVDESS, incorporating added noise and reverberation. Performance evaluation results showed that TCEF significantly outperformed existing feature extraction methods in identifying speakers in diverse acoustic environments.

Variant Time-Frequency Cepstral Features for Speaker Recognition

Time–Frequency Cepstral Features and Heteroscedastic Linear Discriminant Analysis for Language Recognition

Time-Frequency Cepstral Features and Combining Discriminative Training for Phonotactic Language Recognition

Multi-feature Combination for Speaker Recognition

Multi-resolution Time Frequency Feature and Complementary Combination for Short Utterance Speaker Recognition

Time-frequency Network for Robust Speaker Recognition

Speaker Recognition Using DMFCC over Telephone Channels

On the Importance of Components of the MFCC in Speech and Speaker Recognition.

Short Utterance Speaker Recognition Based on Speech High Frequency Information Compensation and Dynamic Feature Enhancement Methods

Text-Dependent Speaker Recognition with Long-Term Features Based on Functional Data Analysis

Improving Speaker Verification Performance Against Long-Term Speaker Variability

Auditory model-based speech feature extraction and its application to speaker identification

Robust Feature Extraction Using Temporal Context Averaging for Speaker Identification in Diverse Acoustic Environments

A Novel I-Vector Framework Using Multiple Features and PCA for Speaker Recognition in Short Speech Condition

The predictive differential amplitude spectrum for robust speaker recognition in stationary noises

A novel hybrid feature method based on Caelen auditory model and gammatone filterbank for robust speaker recognition under noisy environment and speech coding distortion

Fractional Fourier Transform Based Auditory Feature for Language Identification

Speaker Verification Using Simple Temporal Features and Pitch Synchronous Cepstral Coefficients

Entropy of Energy Operator As Feature for Large Vocabulary Mandarin Speaker Independent Speech Recognition

Design and implementation of speech recognition algorithm based on frequency range

TRSD: A Time-Varying and Region-Changed Speech Database for Speaker Recognition