Abstract:Speaker identification is a method of identifying an individual from a set of speakers, and text-independent speaker identification systems allow speakers to utter any phrase without any constraints. This study is focused on raw audio analysis as phase, fine-grained frequency patterns, timing cues, and other minute characteristics are preserved when raw waveforms are processed as compared to handcrafted features like Mel-Frequency Cepstral Coefficients (MFCC) and visual representation of audio-like spectrogram. Due to the depth of information, which includes variations in speech rhythm, pitch, and vocal tract shape, it is beneficial for identifying speakers. The deep learning architecture known as SincNet has gained popularity in speaker identification because of its parametric Sinc functions that allow it to operate directly on the raw audio input. In this paper, we have considered SincNet as the baseline model for speaker identification. The effect of proper speech boundary detection, including high-level features and effective optimizer selection are analysed. The precise identification of the signal start and terminus point is important for eliminating the redundant non-speech regions. We have included endpoint detection module as a pre-processing step in the system. Proper feature extraction and selection are crucial to the model's success. To extract more abstract features from the data, we have added more convolution layers to the original SincNet model. Further, we investigated the hyperparameter tuning protocol's sensitivity to the optimizer and selected the suitable optimizer for raw audio analysis. With all the modifications in the system architecture, we are able to archive improvements of 12.76 %, 13.33 %, and 13.39 % respectively for training, validation, and testing over the original SincNet model. In terms of validation loss, our proposed approach attains 0.35 in comparison to the original SincNet loss of 1.02. With this significant improvement, the total training time is marginally increased by 20 minutes for our proposed model. We have performed our investigation on the LibriSpeech dataset to check the effectiveness of our proposed system in comparison to the other model..

Modified layer deep convolution neural network for text-independent speaker recognition

Text-independent voiceprint recognition via compact embedding of dilated deep convolutional neural networks

A stacked convolutional neural network framework with multi-scale attention mechanism for text-independent voiceprint recognition

A focus module-based lightweight end-to-end CNN framework for voiceprint recognition

Self-attention Based Speaker Recognition Using Cluster-Range Loss

An optimized attention based hybrid deep learning framework for automatic speaker identification from speech signals

End-to-End Feature Learning for Text-Independent Speaker Verification

Voice Presentation Attack Detection Using Convolutional Neural Networks

Text-independent speaker identification using modified SincNet with robust features from suitable acoustic region and appropriate optimizer for raw audio analysis

Weighted Cluster-Range Loss and Criticality-Enhancement Loss for Speaker Recognition

An efficient speaker identification framework based on Mask R-CNN classifier parameter optimized using hosted cuckoo optimization (HCO)

Speaker recognition using Improved Butterfly Optimization Algorithm with hybrid Long Short Term Memory network

A deep learning approach for text-independent speaker recognition with short utterances

Self-Attention Networks for Text-Independent Speaker Verification

Identification and Recognition of Speaker Voice Using a Neural Network-Based Algorithm

Speaker Verification using Convolutional Neural Networks

Deep Speaker Feature Learning for Text-independent Speaker Verification

Towards Speaker Identification with Minimal Dataset and Constrained Resources using 1D-Convolution Neural Network

RSKNet-MTSP: Effective and Portable Deep Architecture for Speaker Verification

Speech Recognition using Convolution Deep Neural Networks

CACRN-Net: A 3D log Mel spectrogram based channel attention convolutional recurrent neural network for few-shot speaker identification