Dirichlet process mixture model based on topologically augmented signal representation for clustering infant vocalizations

Guillem Bonafos,Clara Bourot,Pierre Pudlo,Jean-Marc Freyermuth,Laurence Reboul,Samuel Tronçon,Arnaud Rey
2024-07-08
Abstract:Based on audio recordings made once a month during the first 12 months of a child's life, we propose a new method for clustering this set of vocalizations. We use a topologically augmented representation of the vocalizations, employing two persistence diagrams for each vocalization: one computed on the surface of its spectrogram and one on the Takens' embeddings of the vocalization. A synthetic persistent variable is derived for each diagram and added to the MFCCs (Mel-frequency cepstral coefficients). Using this representation, we fit a non-parametric Bayesian mixture model with a Dirichlet process prior to model the number of components. This procedure leads to a novel data-driven categorization of vocal productions. Our findings reveal the presence of 8 clusters of vocalizations, allowing us to compare their temporal distribution and acoustic profiles in the first 12 months of life.
Applications,Sound,Audio and Speech Processing,Machine Learning
What problem does this paper attempt to address?
This paper mainly explores how to use the Dirichlet process mixture model based on topology-enhanced signal representation to cluster infant sound segments. In the study, the authors utilized monthly audio recordings of an infant in the first year after birth and extracted a set of acoustic features. They represented the sound as two topologically enhanced versions of persistence graphs based on the spectrogram surface and Takens embedding, and combined them with mel-frequency cepstral coefficients (MFCCs) to create a low-dimensional representation. The goal of the paper is to propose a clustering method that does not require a predefined number of categories but estimates them from the data. By using non-parametric Bayesian models, especially the Dirichlet process mixture model, researchers were able to cluster sound segments based on the topological information of the signals. This approach revealed eight distinct sound categories, enabling researchers to compare the distribution and acoustic characteristics of these categories over twelve months. In the data analysis section, the paper demonstrates the temporal variations in production for different categories and their acoustic differences. For example, some categories appear early and gradually decrease, while others appear later, consistent with known stages in infant language development. Furthermore, the topology-enhanced representation helps identify clusters with different acoustic features, revealing the development of infant control over the larynx and articulatory organs within a year. However, the study also points out its limitations, such as neglecting time-dependence and possible information loss, especially in constructing synthetic persistence variables. Future work may require considering more data from children to establish a more comprehensive classification system and further improve the model to better capture time-series information and topological representations.