Abstract:Automatic inference of paralinguistic information from speech, such as age, is an important area of research with many technological applications. Speaker age estimation can help with age-appropriate curation of information content and personalized interactive experiences. However, automatic speaker age estimation in children is challenging due to the paucity of speech data representing the developmental spectrum, and the large signal variability including within a given age group. Most prior approaches in child speaker age estimation adopt methods directly drawn from research on adult speech. In this paper, we propose a novel technique that exploits temporal variability present in children's speech for estimation of children's age. We focus on phone durations as biomarker of children's age. Phone duration distributions are derived by forced-aligning children's speech with transcripts. Regression models are trained to predict speaker age among children studying in kindergarten up to grade 10. Experiments on two children's speech datasets are used to demonstrate the robustness and portability of proposed features over multiple domains of varying signal conditions. Phonemes contributing most to estimation of children speaker age are analyzed and presented. Experimental results suggest phone durations contain important development-related information of children. The proposed features are also suited for application under low data scenarios.

Investigating Efficient Feature Representation Methods and Training Objective for BLSTM-Based Phone Duration Prediction.

ViSPer: A Multilingual TTS Approach Based on VITS Using Deep Feature Loss

Improve Speech Enhancement Using Perception-High-Related Time-Frequency Loss.

An initial research: Towards accurate pitch extraction for speech synthesis based on BLSTM

Duration optimization of speaker adaptation in Mandarin TTS

Acoustic BPE for Speech Generation with Discrete Tokens

Improving Prosodic Boundaries Prediction For Mandarin Speech Synthesis By Using Enhanced Embedding Feature And Model Fusion Approach

Phoneme Dependent Speaker Embedding And Model Factorization For Multi-Speaker Speech Synthesis And Adaptation

Blstm-Crf Based End-To-End Prosodic Boundary Prediction With Context Sensitive Embeddings In A Text-To-Speech Front-End

LEARNING CROSS-LINGUAL INFORMATION WITH MULTILINGUAL BLSTM FOR SPEECH SYNTHESIS OF LOW-RESOURCE LANGUAGES

Combining Extreme Learning Machine And Decision Tree For Duration Prediction In Hmm Based Speech Synthesis

Mandarin-English bilingual phone modeling and combining MPE based Discriminative training for cross-language speech recognition

Aligner-Guided Training Paradigm: Advancing Text-to-Speech Models with Aligner Guided Duration

Time-Frequency Cepstral Features and Combining Discriminative Training for Phonotactic Language Recognition

Improving Cross-Lingual Speech Synthesis with Triplet Training Scheme

Automatic Conversion from Lexical Words to Prosodic Words for Mandarin Text-to-speech System

Total-Duration-Aware Duration Modeling for Text-to-Speech Systems

Phone modeling and combining discriminative training for Mandarin-English bilingual speech recognition

Phone duration modeling for speaker age estimation in children

Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis

Expressive, Variable, and Controllable Duration Modelling in TTS