Abstract:Recently, biometric authorizations using fingerprint, voiceprint, and facial features have garnered considerable attention from the public with the development of recognition techniques and popularization of the smartphone. Among such biometrics, voiceprint has a personal identity as high as that of fingerprint and also uses a noncontact mode to recognize similar faces. Speech signal-processing is one of the keys to accuracy in voice recognition. Most voice-identification systems still employ the mel-scale frequency cepstrum coefficient (MFCC) as the key vocal feature. The quality and accuracy of the MFCC are dependent on the prepared phrase, which belongs to text-dependent speaker identification. In contrast, several new features, such as d-vector, provide a black-box process in vocal feature learning. To address these aspects, a novel data-driven approach for vocal feature extraction based on a decision-support system (DSS) is proposed in this study. Each speech signal can be transformed into a vector representing the vocal features using this DSS. The establishment of this DSS involves three steps: (i) voice data preprocessing, (ii) hierarchical cluster analysis for the inverse discrete cosine transform cepstrum coefficient, and (iii) learning the E-vector through minimization of the Euclidean metric. We compare experiments to verify the E-vectors extracted by this DSS with other vocal features measures and apply them to both text-dependent and text-independent datasets. In the experiments containing one utterance of each speaker, the average accuracy of the E-vector is improved by approximately 1.5% over the MFCC. In the experiments containing multiple utterances of each speaker, the average micro-F1 score of the E-vector is also improved by approximately 2.1% over the MFCC. The results of the E-vector show remarkable advantages when applied to both the Texas Instruments/Massachusetts Institute of Technology corpus and LibriSpeech corpus. These improvements of the E-vector contribute to the capabilities of speaker identification and also enhance its usability for more real-world identification tasks.

Multi-Band Speech Tensor Decomposition for Interactive Feature Extraction in Early Dysphagia Screening.

Hybrid Network Feature Extraction for Depression Assessment from Speech

Attentive-based Multi-level Feature Fusion for Voice Disorder Diagnosis

Deep learning approach for dysphagia detection by syllable-based speech analysis with daily conversations

A Triplet Multimodel Transfer Learning Network for Speech Disorder Screening of Parkinson’s Disease

Automatic Assessment of Dysarthria Using Audio-visual Vowel Graph Attention Network

Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition

Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction

A multi-stage transfer learning strategy for diagnosing a class of rare laryngeal movement disorders

A Multimodal Approach for Dementia Detection from Spontaneous Speech with Tensor Fusion Layer

Data-Driven Decision-Support System for Speaker Identification Using E-Vector System

Enhancing dysarthria speech feature representation with empirical mode decomposition and Walsh-Hadamard transform

Automatic Detection System for Velopharyngeal Insufficiency Based on Acoustic Signals from Nasal and Oral Channels

Y-Vector: Multiscale Waveform Encoder for Speaker Embedding

Language-Independent Approach for Automatic Computation of Vowel Articulation Features in Dysarthric Speech Assessment

A Hierarchical Framework for Multi-document Summarization of Dissertation Abstracts

Automatic cross‐ and multi‐lingual recognition of dysphonia by ensemble classification using deep speaker embedding models

UTran-DSR: a novel transformer-based model using feature enhancement for dysarthric speech recognition

Pre-trained models for detection and severity level classification of dysarthria from speech

A novel feature extraction method based on TQWT and instantaneous energy variation for Parkinson's disease detection

I-vector Based Within Speaker Voice Quality Identification on connected speech