Assessment and Classification of Singing Quality Based on Audio-Visual Features

Marigona Bokshi,Fei Tao,Carlos Busso,John H. L. Hansen
DOI: https://doi.org/10.1109/vcip.2017.8305078
2017-01-01
Abstract:The process of speech production changes between speaking and singing due to excitation, vocal tract articulatory positioning, and cognitive motor planning while singing. Singing does not only deviate from typical spoken speech, but it varies across various styles of singing. This is due to alternative genres of music, singing quality of an individual, as well as different languages and cultures. Because of this variation, it is important to establish a baseline system for differentiating between certain aspects of singing. In this study, we establish a classification system that automatically estimates singing quality of candidates from an American TV singing show based on their singing speech acoustics, lip and eye movements. We employ three classifiers that include: Logistic Regression, Naive Bayes and K-nearest neighbor (k-NN) and compare performance of each using unimodal and multimodal features. We also compare performance based on different modalities (speech, lip, eye structure). The results show that audio content performs the best, with modest gains when lip and eye content are fused. An interesting outcome is that lip and eye content achieve an 82% quality assessment while audio achieves 95%. The ability to assess singing quality from lip and eye content at this level is remarkable.
What problem does this paper attempt to address?