Abstract:Certain environmental noises have been associated with negative developmental outcomes for infants and young children. Though classifying or tagging sound events in a domestic environment is an active research area, previous studies focused on data collected from a non-stationary microphone placed in the environment or from the perspective of adults. Further, many of these works ignore infants or young children in the environment or have data collected from only a single family where noise from the fixed sound source can be moderate at the infant's position or vice versa. Thus, despite the recent success of large pre-trained models for noise event detection, the performance of these models on infant-centric noise soundscapes in the home is yet to be explored. To bridge this gap, we have collected and labeled noises in home soundscapes from 22 families in an unobtrusive manner, where the data are collected through an infant-worn recording device. In this paper, we explore the performance of a large pre-trained model (Audio Spectrogram Transformer [AST]) on our noise-conditioned infant-centric environmental data as well as publicly available home environmental datasets. Utilizing different training strategies such as resampling, utilizing public datasets, mixing public and infant-centric training sets, and data augmentation using noise and masking, we evaluate the performance of a large pre-trained model on sparse and imbalanced infant-centric data. Our results show that fine-tuning the large pre-trained model by combining our collected dataset with public datasets increases the F1-score from 0.11 (public datasets) and 0.76 (collected datasets) to 0.84 (combined datasets) and Cohen's Kappa from 0.013 (public datasets) and 0.77 (collected datasets) to 0.83 (combined datasets) compared to only training with public or collected datasets, respectively.

Towards Robust Family-Infant Audio Analysis Based on Unsupervised Pretraining of Wav2vec 2.0 on Large-Scale Unlabeled Family Audio

Visualizations of Complex Sequences of Family-Infant Vocalizations Using Bag-of-Audio-Words Approach Based on Wav2vec 2.0 Features

Enhancing Child Vocalization Classification with Phonetically-Tuned Embeddings for Assisting Autism Diagnosis

Sound Tagging in Infant-centric Home Soundscapes

A Comparison Study on Infant-Parent Voice Diarization

An open-source voice type classifier for child-centered daylong recordings

InfantCryNet: A Data-driven Framework for Intelligent Analysis of Infant Cries

InfantNet: A Deep Neural Network for Analyzing Infant Vocalizations

BYOL for Audio: Exploring Pre-Trained General-Purpose Audio Representations

Low-dimensional representation of infant and adult vocalization acoustics

Analysing the Impact of Audio Quality on the Use of Naturalistic Long-Form Recordings for Infant-Directed Speech Research

Automatic Analysis of the Emotional Content of Speech in Daylong Child-Centered Recordings from a Neonatal Intensive Care Unit

A comparative analysis between Conformer-Transducer, Whisper, and wav2vec2 for improving the child speech recognition

Voice activity detection in the wild: A data-driven approach using teacher-student training

Audio-visual child-adult speaker classification in dyadic interactions

Convolutional Neural Networks for Audio-Based Continuous Infant Cry Monitoring at Home

Classification of Infant Sleep/Wake States: Cross-Attention among Large Scale Pretrained Transformer Networks using Audio, ECG, and IMU Data

AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining

Infant Crying Detection in Real-World Environments

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Robust Self Supervised Speech Embeddings for Child-Adult Classification in Interactions involving Children with Autism