Spectro-temporal acoustical markers differentiate speech from song across cultures

Philippe Albouy,Samuel A. Mehr,Roxane S. Hoyer,Jeremie Ginzburg,Yi Du,Robert J. Zatorre
DOI: https://doi.org/10.1101/2023.01.29.526133
2024-05-13
Abstract:Humans produce two forms of cognitively complex vocalizations: speech and song. It is debated whether these differ based primarily on culturally specific, learned features, or if acoustical features can reliably distinguish them. We study the spectro-temporal modulation patterns of vocalizations produced by 369 people living in 21 urban, rural, and small-scale societies across six continents. Specific ranges of spectral and temporal modulations, overlapping within categories and across societies, significantly differentiate speech from song. Machine-learning classification shows that this effect is cross-culturally robust, vocalizations being reliably classified solely from their spectro-temporal features across all 21 societies. Listeners unfamiliar with the cultures classify these vocalizations using similar spectro-temporal cues as the machine learning algorithm. Finally, spectro-temporal features are better able to discriminate song from speech than a broad range of other acoustical variables, suggesting that spectro-temporal modulation, a key feature of auditory neuronal tuning, accounts for a fundamental difference between these categories.
Neuroscience
What problem does this paper attempt to address?