Abstract:Based on audio recordings made once a month during the first 12 months of a child's life, we propose a new method for clustering this set of vocalizations. We use a topologically augmented representation of the vocalizations, employing two persistence diagrams for each vocalization: one computed on the surface of its spectrogram and one on the Takens' embeddings of the vocalization. A synthetic persistent variable is derived for each diagram and added to the MFCCs (Mel-frequency cepstral coefficients). Using this representation, we fit a non-parametric Bayesian mixture model with a Dirichlet process prior to model the number of components. This procedure leads to a novel data-driven categorization of vocal productions. Our findings reveal the presence of 8 clusters of vocalizations, allowing us to compare their temporal distribution and acoustic profiles in the first 12 months of life.

What problem does this paper attempt to address?

This paper mainly explores how to use the Dirichlet process mixture model based on topology-enhanced signal representation to cluster infant sound segments. In the study, the authors utilized monthly audio recordings of an infant in the first year after birth and extracted a set of acoustic features. They represented the sound as two topologically enhanced versions of persistence graphs based on the spectrogram surface and Takens embedding, and combined them with mel-frequency cepstral coefficients (MFCCs) to create a low-dimensional representation. The goal of the paper is to propose a clustering method that does not require a predefined number of categories but estimates them from the data. By using non-parametric Bayesian models, especially the Dirichlet process mixture model, researchers were able to cluster sound segments based on the topological information of the signals. This approach revealed eight distinct sound categories, enabling researchers to compare the distribution and acoustic characteristics of these categories over twelve months. In the data analysis section, the paper demonstrates the temporal variations in production for different categories and their acoustic differences. For example, some categories appear early and gradually decrease, while others appear later, consistent with known stages in infant language development. Furthermore, the topology-enhanced representation helps identify clusters with different acoustic features, revealing the development of infant control over the larynx and articulatory organs within a year. However, the study also points out its limitations, such as neglecting time-dependence and possible information loss, especially in constructing synthetic persistence variables. Future work may require considering more data from children to establish a more comprehensive classification system and further improve the model to better capture time-series information and topological representations.

Dirichlet process mixture model based on topologically augmented signal representation for clustering infant vocalizations

Low-dimensional representation of infant and adult vocalization acoustics

Infant vocal category exploration as a foundation for speech development

Visualizations of Complex Sequences of Family-Infant Vocalizations Using Bag-of-Audio-Words Approach Based on Wav2vec 2.0 Features

Facilitating deep acoustic phenotyping: A basic coding scheme of infant vocalisations preluding computational analysis, machine learning and clinical reasoning

An open-source voice type classifier for child-centered daylong recordings

Modeling early phonetic acquisition from child-centered audio data

Clustering action potential spikes: Insights on the use of overfitted finite mixture models and Dirichlet process mixture models

Topological data analysis of human vowels: Persistent homologies across representation spaces

A model of infant speech perception and learning

Enhancing Child Vocalization Classification with Phonetically-Tuned Embeddings for Assisting Autism Diagnosis

InfantNet: A Deep Neural Network for Analyzing Infant Vocalizations

Probabilistic graphical model identifies clusters of EEG patterns in recordings from neonates

Bayesian semiparametric Markov renewal mixed models for vocalization syntax

Automated Sex Classification of Children's Voices and Changes in Differentiating Factors with Age

Audio-visual child-adult speaker classification in dyadic interactions

Investigation of the Assessment of Infant Vocalizations by Laypersons

Longitudinal trajectories of the neural encoding mechanisms of speech-sound features during the first year of life

Audio–Visual Deep Clustering for Speech Separation

Covariate-guided Bayesian mixture model for multivariate time series

A computational model of early language acquisition from audiovisual experiences of young infants