Abstract:Short-utterance speaker identification is a difficult area of study in natural language processing (NLP). Most cutting-edge experimental approaches for speech processing make use of convolutional neural networks (CNNs) and deep neural networks and analyse data in a unidirectional stream of time. In the past, approaches for identifying speakers that utilised CNNs often made use of highly dense or vast layers, leading to a large number of factors and significant computational expenses. In this article, we provide a novel multi-scale attention-focused 1-dimensional convolutional neural network (MSA-CNN) for recognising speakers that combines L1 and L2 norms. The multi-scale convolutional training architecture was developed to autonomously extract multi-scale characteristics of raw audio data by employing a variety of filter banks. In order for the multi-scale system to emphasis on important speaker feature characteristics in varying settings, a novel attention mechanism was built. In the end, it was combined and applied to the suggested multi-layered convolutional neural network framework to identify the speakers' labels. The recommended network model was tested on a number of standard voice databases and real time recorded corpus. The findings from the experiments demonstrate that our methodology outperformed a baseline CNN scheme (without an attention mechanism) in addition to conventional speaker identification techniques involving feature engineering, achieving an accuracy rate of 97.94% across numerous databases as well as distortion constraints.

Multi-Stride Self-Attention for Speech Recognition

Multi-layer Attention Mechanism for Speech Keyword Recognition

Self-Attention Networks for Text-Independent Speaker Verification

Self-Attention Transducers for End-to-End Speech Recognition

Progressive Multi-scale Self-supervised Learning for Speech Recognition

Multi-Loss Convolutional Network with Time-Frequency Attention for Speech Enhancement

Pyramid Multi-branch Fusion DCNN with Multi-Head Self-Attention for Mandarin Speech Recognition

Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention

A General Multi-Task Learning Framework to Leverage Text Data for Speech to Text Tasks

Speech Enhancement Using Multi-Stage Self-Attentive Temporal Convolutional Networks

Self-attention Based Speaker Recognition Using Cluster-Range Loss

Improved BLSTM RNN Based Accent Speech Recognition Using Multi-task Learning and Accent Embeddings

Speech Emotion Recognition Using Multi-hop Attention Mechanism

CSMA-CNER:Multi-modal Chinese NER Task with Cross- and Self-Modality Attention

MFA: TDNN with Multi-scale Frequency-channel Attention for Text-independent Speaker Verification with Short Utterances

End-to-End Speech Recognition Model Based on Dilated Sparse Aware Network

Bidirectional Attention For Text-Dependent Speaker Verification

U-Former: Improving Monaural Speech Enhancement with Multi-head Self and Cross Attention

A stacked convolutional neural network framework with multi-scale attention mechanism for text-independent voiceprint recognition

Prosodic Structure Prediction Using Deep Self-attention Neural Network