CACRN-Net: A 3D log Mel spectrogram based channel attention convolutional recurrent neural network for few-shot speaker identification
Banala Saritha,Mohammad Azharuddin Laskar,Anish Monsley K,Rabul Hussain Laskar,Madhuchhanda Choudhury
DOI: https://doi.org/10.1016/j.compeleceng.2024.109100
IF: 4.152
2024-02-02
Computers & Electrical Engineering
Abstract:Advancements in deep learning for speaker identification are constrained by the limited availability of data, especially in law enforcement applications. This has led to the emergence of few-shot speaker identification, a technique that classifies unseen test samples with the help of a few support samples. Despite several attempts to advance few-shot speaker identification, significant challenges persist, including the extraction of robust speaker embeddings, the problem of overfitting, and the issue of prototype shift error. This paper proposes a few-shot speaker identification system employing a novel architecture called the Channel Attention-based Convolutional Recurrent Neural Network (CACRN-Net) with three-dimensional (3D) log Mel spectrogram inputs to mitigate overfitting and enhance the accuracy of speaker embeddings. Furthermore, a self-attention mechanism alleviates prototype shift errors caused by noisy data. The proposed framework is compared to existing methods using VCTK and Voxceleb1 speech corpora through 5-way, 5-shot learning experiments. To assess the performance of the framework in speech variability conditions, we utilized the IIT Guwahati (IITG) multi-variability (MV) speech database. The proposed approach outperforms state-of-the-art techniques, achieving a substantial enhancement in speaker identification with a 2.73 % accuracy improvement on the VCTK database and a 2.3 % improvement on Voxceleb1.
engineering, electrical & electronic,computer science, interdisciplinary applications, hardware & architecture