Can a Machine Distinguish High and Low Amount of Social Creak in Speech?

Anne-Maria Laukkanen,Sudarsana Reddy Kadiri,Shrikanth Narayanan,Paavo Alku

2024-10-22

Abstract:Objectives: ncreased prevalence of social creak particularly among female speakers has been reported in several studies. The study of social creak has been previously conducted by combining perceptual evaluation of speech with conventional acoustical parameters such as the harmonic-to-noise ratio and cepstral peak prominence. In the current study, machine learning (ML) was used to automatically distinguish speech of low amount of social creak from speech of high amount of social creak. Methods: The amount of creak in continuous speech samples produced in Finnish by 90 female speakers was first perceptually assessed by two voice specialists. Based on their assessments, the speech samples were divided into two categories (low $vs$. high amount of creak). Using the speech signals and their creak labels, seven different ML models were trained. Three spectral representations were used as feature for each model. Results: The results show that the best performance (accuracy of 71.1\%) was obtained by the following two systems: an Adaboost classifier using the mel-spectrogram feature and a decision tree classifier using the mel-frequency cepstral coefficient feature. Conclusions: The study of social creak is becoming increasingly popular in sociolinguistic and vocological research. The conventional human perceptual assessment of the amount of creak is laborious and therefore ML technology could be used to assist researchers studying social creak. The classification systems reported in this study could be considered as baselines in future ML-based studies on social creak.

Audio and Speech Processing,Artificial Intelligence,Computation and Language,Machine Learning,Sound

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **Automatically distinguish the high and low levels of social creak in speech**. Specifically, the research background points out that social creak (i.e., a rough, low - pitched voice feature) is becoming more and more common among female speakers. Previous studies mainly relied on the auditory perception evaluation of human experts to quantify the amount of this voice feature, but this method is time - consuming and easily influenced by personal biases. Therefore, this paper proposes to use machine learning (ML) techniques to automatically distinguish speech signals with low - and high - amount social creak. ### Main research questions: 1. **Feasibility of automatic classification**: Can machine learning methods be used to automatically distinguish between speech with low - and high - amount social creak? 2. **Best model selection**: Compare multiple existing machine learning classifiers and find the model that performs best in this classification task. 3. **Feature selection**: Explore the impact of different acoustic features (such as spectrogram, mel - spectrogram, and mel - frequency cepstral coefficients) on classification performance. ### Research methods: - **Dataset**: From the continuous speech samples of 90 Finnish female speakers, two speech experts divided the speech into two categories of low - and high - amount social creak according to the perceptual evaluation. - **Feature extraction**: Use three common acoustic feature representations: spectrogram, mel - spectrogram, and mel - frequency cepstral coefficients (MFCCs). - **Classifier**: Seven different machine learning models were trained and tested: support vector machine (SVM), random forest (RF), multi - layer perceptron (MLP), logistic regression (LR), decision tree (DT), and Adaboost. ### Main results: - **Best classification performance**: The systems using the Adaboost classifier combined with mel - spectrogram features and the decision tree classifier combined with MFCC features performed best, with an accuracy rate of 71.1%. - **Feature comparison**: The mel - spectrogram feature generally provided higher classification accuracy, superior to spectrogram and MFCC features. ### Conclusions: The research shows that machine learning algorithms can automatically distinguish between speech with low - and high - amount social creak to a certain extent, with an accuracy rate of about 70%. This result can be used as a baseline for future related research and provide a reference for further exploration of more complex classification tasks (such as distinguishing social creak in natural conversations). ### Formula display (example): In the description of the feature extraction process, some common formulas are involved, such as the fast Fourier transform (FFT) used when calculating the spectrogram: \[ X[k]=\sum_{n = 0}^{N - 1}x[n]e^{-j2\pi kn/N} \] where $X[k]$ is the $k$ - th frequency component, $x[n]$ is the time - series signal, and $N$ is the number of sampling points. Through these methods, the research shows the potential of machine learning in automatically identifying social creak.

Can a Machine Distinguish High and Low Amount of Social Creak in Speech?

Machine learning based estimation of hoarseness severity using sustained vowels

Explainable machine learning reveals the relationship between hearing thresholds and speech-in-noise recognition in listeners with normal audiograms

Spectro-temporal acoustical markers differentiate speech from song across cultures

Machine learning techniques for speech emotion recognition using paralinguistic acoustic features

Cervical Auscultation Machine Learning for Dysphagia Assessment

Modelling human speech recognition in challenging noise maskers using machine learning

Detecting schizophrenia, bipolar disorder, psychosis vulnerability and major depressive disorder from 5 minutes of online-collected speech

Speaker Fluency Level Classification Using Machine Learning Techniques

Assessing clinical utility of Machine Learning and Artificial Intelligence approaches to analyze speech recordings in Multiple Sclerosis: A Pilot Study

The quantitative prevalence of creaky voice (vocal fry) in varieties of English: A systematic review of the literature

Comparison of the prediction accuracy of machine learning algorithms in crosslinguistic vowel classification

Validation of Machine Learning-Based Assessment of Major Depressive Disorder from Paralinguistic Speech Characteristics in Routine Care

Comparison of disordered swallowing patterns in patients with recurrent cortical/subcortical stroke and first-time brainstem stroke.

Automatic speech analysis for detecting cognitive decline of older adults

Automated Dysarthria Severity Classification: A Study on Acoustic Features and Deep Learning Techniques

A Hybrid Machine-Learning-Based Method for Analytic Representation of the Vocal Fold Edges during Connected Speech

An explainable machine learning model of cognitive decline derived from speech

Study of Various Machine Learning Algorithms for use with Automatic Speech Recognition

What You Say or How You Say It? Depression Detection Through Joint Modeling of Linguistic and Acoustic Aspects of Speech

How are We Doing Today? Using Natural Speech Analysis to Assess Older Adults' Subjective Well-Being