Abstract:In order to process speech, most state-of-the-art experimental methods employ convolutional neural networks (CNNs), which operate on a continuous, 1-dimensional (1-D) time stream. In an audio signal, the mel-spectrogram facilitates the representation of attributes of the utterances' in the frequency domain (which corresponds to the speech spectrum). Moreover, for a time-series speaker signal, CNNs are superior to machine or transfer learning models in capturing characteristics from long-form talks. This paper introduces a jump-connected 1-D CNN that employs a combined loss function for speaker recognition. The suggested model uses a 1-D convolutional layer combined with jump connections to extract speaker-specific characteristics; this reduces time-based and frequency-based variability for faster computing. A combined softmax loss, stable L2-norm, and smooth L1-norm loss function guide the proposed compact convolutional neural networks (CCNN) to identify the correct spokesman with improved efficacy. We evaluated the proposed framework using various standard and real-time audio datasets. The experimental findings demonstrate that the proposed CCNN outperforms existing approaches by reducing the equal error rate by 9.02 %. Also, our recommended voiceprint identification model achieves an impressive average speaker recognition rate of 98.76 %. Simultaneously, the reliability of the 1-D CCNN is tested under various conditions. Other fields of study, like language modelling, could employ this approach after some fine-tuning. Relevance of the work: Speaker recognition is an area of interest in which machine learning (ML) and deep learning (DL) schemes, when combined, have the potential to make history in the areas of forensic sciences, automation, and authentication. Using a modest CNN can enhance the identification and verification process by ignoring many issues such as false positives, background noise, and so on. Expanding this process would facilitate raga identification and disease treatment therapies.

3D Convolutional Neural Networks Based Speaker Identification and Authentication.

LVID: A Multimodal Biometrics Authentication System on Smartphones.

LipPass: Lip Reading-based User Authentication on Smartphones Leveraging Acoustic Signals.

Lip Reading-Based User Authentication Through Acoustic Sensing on Smartphones.

Look, Listen and Learn - A Multimodal LSTM for Speaker Identification

CACRN-Net: A 3D log Mel spectrogram based channel attention convolutional recurrent neural network for few-shot speaker identification

Audio-Visual System for Robust Speaker Recognition.

Self-attention Based Speaker Recognition Using Cluster-Range Loss

End-to-End Feature Learning for Text-Independent Speaker Verification

Robust Speaking Face Identification For Video Analysis

Lip2AudSpec: Speech reconstruction from silent lip movements video

Application of deep learning in Mandarin Chinese lip-reading recognition

Lip Recognition Based on 3D Convolutional Neural Network

HMM-based Lip Reading with Stingy Residual 3D Convolution

Adaptive Semantic-Spatio-Temporal Graph Convolutional Network for Lip Reading

Few-Shot Speaker Identification Using Depthwise Separable Convolutional Network with Channel Attention

Text-independent voiceprint recognition via compact embedding of dilated deep convolutional neural networks

Towards Speaker Identification with Minimal Dataset and Constrained Resources using 1D-Convolution Neural Network

Deep Learning-based Spatio Temporal Facial Feature Visual Speech Recognition

Decoding lip language using triboelectric sensors with deep learning

Speaker verification using attentive multi-scale convolutional recurrent network