Abstract:Short-utterance speaker identification is a difficult area of study in natural language processing (NLP). Most cutting-edge experimental approaches for speech processing make use of convolutional neural networks (CNNs) and deep neural networks and analyse data in a unidirectional stream of time. In the past, approaches for identifying speakers that utilised CNNs often made use of highly dense or vast layers, leading to a large number of factors and significant computational expenses. In this article, we provide a novel multi-scale attention-focused 1-dimensional convolutional neural network (MSA-CNN) for recognising speakers that combines L1 and L2 norms. The multi-scale convolutional training architecture was developed to autonomously extract multi-scale characteristics of raw audio data by employing a variety of filter banks. In order for the multi-scale system to emphasis on important speaker feature characteristics in varying settings, a novel attention mechanism was built. In the end, it was combined and applied to the suggested multi-layered convolutional neural network framework to identify the speakers' labels. The recommended network model was tested on a number of standard voice databases and real time recorded corpus. The findings from the experiments demonstrate that our methodology outperformed a baseline CNN scheme (without an attention mechanism) in addition to conventional speaker identification techniques involving feature engineering, achieving an accuracy rate of 97.94% across numerous databases as well as distortion constraints.

CNN with Phonetic Attention for Text-Independent Speaker Verification.

Bidirectional Attention For Text-Dependent Speaker Verification

End-to-End Attention based Text-Dependent Speaker Verification

Introducing Phonetic Information to Speaker Embedding for Speaker Verification

Self-Attention Networks for Text-Independent Speaker Verification

Self-attention Based Speaker Recognition Using Cluster-Range Loss

Phonetic-Attention Scoring for Deep Speaker Features in Speaker Verification

End-to-End Feature Learning for Text-Independent Speaker Verification

A stacked convolutional neural network framework with multi-scale attention mechanism for text-independent voiceprint recognition

Improving Deep CNN Networks with Long Temporal Context for Text-Independent Speaker Verification

Phonetic-aware speaker embedding for far-field speaker verification

Neighborhood Attention Transformer with Progressive Channel Fusion for Speaker Verification

An Interpretable and Generalizable Speech Detector Based on a CNN-LSTM Framework

Speaker Embedding Extraction with Phonetic Information

Multi-Frequency Information Enhanced Channel Attention Module for Speaker Representation Learning

Deep Speaker Embedding Learning with Multi-level Pooling for Text-independent Speaker Verification

Double Multi-Head Attention for Speaker Verification

Introducing Multilingual Phonetic Information to Speaker Embedding for Speaker Verification

Weighted Cluster-Range Loss and Criticality-Enhancement Loss for Speaker Recognition

Hierarchically Attending Time-Frequency and Channel Features for Improving Speaker Verification

Learning Deep Embedding with Acoustic and Phoneme Features for Speaker Recognition in FM Broadcasting