Video Content Analysis and Its Applications for Multimedia Authoring of Presentations
Ting-Chuen Pong,Chong-Wah Ngo,Feng Wang
DOI: https://doi.org/10.14711/thesis-b938188
2006-01-01
Abstract:Video content analysis has attracted more and more researchers' attentions in recent years, due to the availability of a growing amount of digital video data. In this work, we address the problem of video content analysis by extracting three high-level features in videos namely text, gesture and head pasture, and employ them in several applications for multimedia authoring of presentations based on video understanding. For text analysis, we address the problem of text recognition in low-resolution videos. A novel algorithm for video text super-resolution is proposed, which reconstructs high-resolution textboxes by integrating multiple frames. Our experiments show that text recognition is significantly improved after super-resolution. For gesture detection and recognition, we propose algorithms for both off-line and real-time applications. In the former, to deal with the lack of salient features in gesture detection, different cues including frame difference, skin color and gesture trajectory are combined to detect candidate gestures. HMM (Hidden Markov Model) based gesture recognition is then employed to refine the results of gesture detection and extract intentional gestures. For real-time applications, to cope with the efficiency requirements besides accuracy, the JIM I models for complete gesture recognition are modified to recognize incomplete gestures, so that a gesture can be identified before the complete trajectory is observed. Speech is combined with visual cue to further improve the accuracy and the responsiveness of gesture detection. For head posture, two different algorithms are proposed to estimate the face orientation. The first one is more appropriate for offline applications by employing visual cue and image processing techniques. In the second algorithm, besides visual cue, we focus more on effectively exploiting contextual information, i.e. temporal smoothness of head movement to refine the pose estimation. This is useful especially for low-resolution images where direct estimation from one single image is not reliable enough. We propose an adaptive online learning approach to deal with different presenting styles. The second algorithm is efficient enough for most real-time applications.Based on the video content analysis, we employ the extracted features to develop several applications, including the synchronization of video and external documents based on text analysis, the offline video enhancement and editing by integrating gesture, posture and text, and a simulated smartboard to show the effectiveness and efficiency of the proposed algorithms. Specifically for video editing, a novel gesture and posture driven editing approach is proposed to trace the flow of lecturing, by attending to the focus of lecturing at any moment. Meanwhile, the aesthetic elements, which outline the general and basic rules of selecting and adjoining various views of focuses, are taken into account to generate the appropriate rhythm for showing the dynamic interactions between the presenter and the focuses. To improve the visual readabilities of the projected and handwritten words of the edited video, two approaches are also proposed to enhance the visibility of texts on the LCD projected screen and the whiteboard respectively.