Abstract:Video content analysis has attracted more and more researchers' attentions in recent years, due to the availability of a growing amount of digital video data. In this work, we address the problem of video content analysis by extracting three high-level features in videos namely text, gesture and head pasture, and employ them in several applications for multimedia authoring of presentations based on video understanding. For text analysis, we address the problem of text recognition in low-resolution videos. A novel algorithm for video text super-resolution is proposed, which reconstructs high-resolution textboxes by integrating multiple frames. Our experiments show that text recognition is significantly improved after super-resolution. For gesture detection and recognition, we propose algorithms for both off-line and real-time applications. In the former, to deal with the lack of salient features in gesture detection, different cues including frame difference, skin color and gesture trajectory are combined to detect candidate gestures. HMM (Hidden Markov Model) based gesture recognition is then employed to refine the results of gesture detection and extract intentional gestures. For real-time applications, to cope with the efficiency requirements besides accuracy, the JIM I models for complete gesture recognition are modified to recognize incomplete gestures, so that a gesture can be identified before the complete trajectory is observed. Speech is combined with visual cue to further improve the accuracy and the responsiveness of gesture detection. For head posture, two different algorithms are proposed to estimate the face orientation. The first one is more appropriate for offline applications by employing visual cue and image processing techniques. In the second algorithm, besides visual cue, we focus more on effectively exploiting contextual information, i.e. temporal smoothness of head movement to refine the pose estimation. This is useful especially for low-resolution images where direct estimation from one single image is not reliable enough. We propose an adaptive online learning approach to deal with different presenting styles. The second algorithm is efficient enough for most real-time applications.Based on the video content analysis, we employ the extracted features to develop several applications, including the synchronization of video and external documents based on text analysis, the offline video enhancement and editing by integrating gesture, posture and text, and a simulated smartboard to show the effectiveness and efficiency of the proposed algorithms. Specifically for video editing, a novel gesture and posture driven editing approach is proposed to trace the flow of lecturing, by attending to the focus of lecturing at any moment. Meanwhile, the aesthetic elements, which outline the general and basic rules of selecting and adjoining various views of focuses, are taken into account to generate the appropriate rhythm for showing the dynamic interactions between the presenter and the focuses. To improve the visual readabilities of the projected and handwritten words of the edited video, two approaches are also proposed to enhance the visibility of texts on the LCD projected screen and the whiteboard respectively.

Content Extraction from Lecture Video via Speaker Action Classification Based on Pose Information

Lecture Video Enhancement and Editing by Integrating Posture, Gesture, and Text

Content Based Lecture Video Retrieval Using Speech and Video Text Information

Student Action Recognition Based on Multiple Features

Structuring Lecture Videos by Automatic Projection Screen Localization and Analysis

Realistic Speech-Driven Talking Video Generation with Personalized Pose

Exploiting Self-Adaptive Posture-Based Focus Estimation for Lecture Video Editing

Structuring Lecture Videos for Distance Learning Applications

Video Content Analysis and Its Applications for Multimedia Authoring of Presentations

Pose-aware video action segmentation

Synchronization of Lecture Videos and Electronic Slides by Video Text Analysis.

Learning realistic human actions from movies.

Online learnable keyframe extraction in videos and its application with semantic word vector in action recognition

Structuring Low-Quality Videotaped Lectures for Cross-Reference Browsing by Video Text Analysis

Accurate Key Frame Extraction Algorithm of Video Action for Aerobics Online Teaching

Understanding Action Sequences based on Video Captioning for Learning-from-Observation

Speech-Section Extraction Using Lip Movement and Voice Information in Japanese

Simulating a Smartboard by Real-Time Gesture Detection in Lecture Videos

Speech2Video Synthesis with 3D Skeleton Regularization and Expressive Body Poses.

Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs

Automatic Generation of Labeled Data for Video-Based Human Pose Analysis via NLP applied to YouTube Subtitles