Multimodal Saliency and Fusion for Movie Summarization Based on Aural, Visual, and Textual Attention
Georgios Evangelopoulos,Athanasia Zlatintsi,Alexandros Potamianos,Petros Maragos,Konstantinos Rapantzikos,Georgios Skoumas,Yannis Avrithis
DOI: https://doi.org/10.1109/tmm.2013.2267205
IF: 7.3
2013-11-01
IEEE Transactions on Multimedia
Abstract:Multimodal streams of sensory information are naturally parsed and integrated by humans using signal-level feature extraction and higher level cognitive processes. Detection of attention-invoking audiovisual segments is formulated in this work on the basis of saliency models for the audio, visual, and textual information conveyed in a video stream. Aural or auditory saliency is assessed by cues that quantify multifrequency waveform modulations, extracted through nonlinear operators and energy tracking. Visual saliency is measured through a spatiotemporal attention model driven by intensity, color, and orientation. Textual or linguistic saliency is extracted from part-of-speech tagging on the subtitles information available with most movie distributions. The individual saliency streams, obtained from modality-depended cues, are integrated in a multimodal saliency curve, modeling the time-varying perceptual importance of the composite video stream and signifying prevailing sensory events. The multimodal saliency representation forms the basis of a generic, bottom-up video summarization algorithm. Different fusion schemes are evaluated on a movie database of multimodal saliency annotations with comparative results provided across modalities. The produced summaries, based on low-level features and content-independent fusion and selection, are of subjectively high aesthetic and informative quality.
computer science, information systems,telecommunications, software engineering