New warped LPC-Based Feature for Fast and robust speech/Music Discrimination
J. E. M. Expósito,S. G. Galán,N. Ruiz-Reyes,P. Vera-Candeas,F. Rivas-Peña
Abstract:ABSTRACT Automatic discrimination of speech and music is an im-portant tool in many multimedia applications. The pa-per presents a low complexity but effective approach forspeech/music discrimination, which exploits only one sim-ple feature, called Warped LPC-based Spectral Centroid(WLPC-SC). A three-component Gaussian Mixture Model(GMM) classifier is used because it showed a slightly bet-ter performance than other Statistical Pattern Recognition(SPR) classifiers. Comparison between WLPC-SC andthe timbral features proposed in [11] is performed, aimingto assess the good discriminatory power of the proposedfeature. Experimental results reveal that our speech/musicdiscriminator is robust and fast, making it suitable for real-time multimedia applications. 1. INTRODUCTION Automatic discrimination between speech and music hasbecome a research topic of interest in the last few years.Several approaches have been described in the recent lit-erature for different applications [1][2] [3][4][5]. Eachof these uses different features and pattern classificationtechniques and describes results on different material.Saunders [1] proposed a real-time speech/music dis-criminator, which was used to automatically monitor theaudio content of FM audio channels. Four statistical fea-tures on the zero-crossing rate and one energy-related fea-ture were extracted, a multivariate-Gaussian classifier wasapplied, which resulted in an accuracy of 98%.In Automatic Speech Recognition (ASR) of broadcastnews, it’s desirable to disable the input to the speech rec-ognizer during the non-speech portion of the audio stream.Scheirer and Slaney [2] developed a speech/music dis-crimination system for ASR of audio sound tracks. Thir-teen features to characterize distinct properties of speechand music, and three classification schemes (MAP Gaus-sian, GMM and k-NN classifiers) were exploited, result-ing in an accuracy of over 90%.Another application that can benefit from distinguish-ing speech from music is low bit-rate audio coding. De-signing an universal coder to reproduce well both speechand music is the best approach. However, it is not a trivialproblem. An alternative approach is to design a multi-mode coder that can accommodate different signals. Theappropriate module is selected using the output of a speech-music classifier [6] [7].Automatic discrimination of speech and music is animportant tool in many multimedia applications. KhaledEl-Maleh et al. [3] combined the line spectral frequen-cies and zero-crossings-based features for frame-level nar-rowband speech/music discrimination. The classificationsystem operates using only a frame delay of 20 ms, mak-ing it suitable for real-time multimedia applications. Anemerging multimedia application is content-based index-ing and retrieval of audiovisual data. Audio content analy-sis is an important task for such application [8]. Minami etal. [9] proposed an audio-based approach to video index-ing, where a speech/music detector is used to help users tobrowse a video database.Comparative view of the value of different types of fea-tures in speech music discrimination is provided in [10],where four types of features (amplitudes, cepstra, pitchand zero-crossings) are compared for discriminating speechand music signals. Experimental results showed cepstraand delta cepstra bring the best performance. Mel Fre-quencies Spectral or Cepstral Coefficients (MFSC or MFCC)are very often used features for audio classification tasks,providing quite good results. In [4], MFSC’s first orderstatistics are combined with neural networks to form aspeech music classifier that is able to generalize from alittle amount of learning data. MFCC are a compact rep-resentation of the spectrum of an audio signal taking intoaccount the nonlinear human perception of pitch, as de-scribed by the mel scale. They are one of the most usedfeatures in speech recognition and have recently proposedin musical genre classification of audio signals [11][12].Unlike the previous works, speech/music discrimina-tion approaches based on only one type of features arepresented in [13] and [5], which result in fast and robustclassification systems. The approach in [13] takes psy-choacoustic knowledge into account in that it uses the lowfrequency modulation amplitudes over 20 critical bands toform a good discriminator for the task, while the approachin [5] exploits a new energy-related feature, called mod-ified low energy ratio, that improves the results obtainedwith the classical low energy ratio.In this paper, we present our contribution to the de-sign of a robust speech/music discrimination system. Thepaper presents a low complexity but effective approach,which also exploits only one simple feature, called Warped