A Mid-Level Scene Change Representation Via Audiovisual Alignment

Jinqiao Wang,Lingyu Duan,Hanging Lu,Jesse S. Jin,Changsheng Xu
DOI: https://doi.org/10.1109/ICASSP.2006.1660366
2006-01-01
Abstract:Scene is a series of semantic correlated video shots. An effective scene detection depends on domain knowledge more or less. Most existing approaches try to directly detect various scene changes by applying clustering or supervised learning methods to low level audiovisual features. However, robustly detecting diverse scene changes derived from complex semantic meanings is still a challenging problem. In this paper we are focused on the association of visual signal changes (e.g. cuts, fade-in, fade-out, etc.) and audio signal changes (e.g. speaker change, background music change, etc.) to propose a mid-level scene change representation, which is meant to locate candidate scene change points by characterizing temporally uncorrelated properties of audio and visual track in the case of scene change happening. By incorporating domain knowledge, enhanced features can be further extracted to complement this representation to bridge semantic gap towards scene change detection. We utilize a camera motion estimation algorithm to detect visual signal changes. Such visual change positions are selected as time-stamp, points. An alignment is performed to search for candidate audio signal change positions by multi-scale Kullback-Leibler(K-L) distance computing. Both metric-based K-L distance approach and model-based HMM are applied to determine true audio signal changes. The associated visual and audio signal changes are considered as the mid-level scene change representation. This representation has been successfully applied to detect boundaries of individual commercial in TV broadcast stream with an accuracy of around 95%. Particularly the systematic alignment approach can be utilized in video summarization.
What problem does this paper attempt to address?