Abstract:Video content analysis has attracted more and more researchers' attentions in recent years, due to the availability of a growing amount of digital video data. In this work, we address the problem of video content analysis by extracting three high-level features in videos namely text, gesture and head pasture, and employ them in several applications for multimedia authoring of presentations based on video understanding. For text analysis, we address the problem of text recognition in low-resolution videos. A novel algorithm for video text super-resolution is proposed, which reconstructs high-resolution textboxes by integrating multiple frames. Our experiments show that text recognition is significantly improved after super-resolution. For gesture detection and recognition, we propose algorithms for both off-line and real-time applications. In the former, to deal with the lack of salient features in gesture detection, different cues including frame difference, skin color and gesture trajectory are combined to detect candidate gestures. HMM (Hidden Markov Model) based gesture recognition is then employed to refine the results of gesture detection and extract intentional gestures. For real-time applications, to cope with the efficiency requirements besides accuracy, the JIM I models for complete gesture recognition are modified to recognize incomplete gestures, so that a gesture can be identified before the complete trajectory is observed. Speech is combined with visual cue to further improve the accuracy and the responsiveness of gesture detection. For head posture, two different algorithms are proposed to estimate the face orientation. The first one is more appropriate for offline applications by employing visual cue and image processing techniques. In the second algorithm, besides visual cue, we focus more on effectively exploiting contextual information, i.e. temporal smoothness of head movement to refine the pose estimation. This is useful especially for low-resolution images where direct estimation from one single image is not reliable enough. We propose an adaptive online learning approach to deal with different presenting styles. The second algorithm is efficient enough for most real-time applications.Based on the video content analysis, we employ the extracted features to develop several applications, including the synchronization of video and external documents based on text analysis, the offline video enhancement and editing by integrating gesture, posture and text, and a simulated smartboard to show the effectiveness and efficiency of the proposed algorithms. Specifically for video editing, a novel gesture and posture driven editing approach is proposed to trace the flow of lecturing, by attending to the focus of lecturing at any moment. Meanwhile, the aesthetic elements, which outline the general and basic rules of selecting and adjoining various views of focuses, are taken into account to generate the appropriate rhythm for showing the dynamic interactions between the presenter and the focuses. To improve the visual readabilities of the projected and handwritten words of the edited video, two approaches are also proposed to enhance the visibility of texts on the LCD projected screen and the whiteboard respectively.

Lecture Video Enhancement and Editing by Integrating Posture, Gesture, and Text

Video Content Analysis and Its Applications for Multimedia Authoring of Presentations

Gesture Tracking and Recognition for Lecture Video Editing.

Exploiting Self-Adaptive Posture-Based Focus Estimation for Lecture Video Editing

Synchronization of Lecture Videos and Electronic Slides by Video Text Analysis.

Simulating a Smartboard by Real-Time Gesture Detection in Lecture Videos

Structuring Lecture Videos for Distance Learning Applications

Structuring Low-Quality Videotaped Lectures for Cross-Reference Browsing by Video Text Analysis

Structuring Lecture Videos by Automatic Projection Screen Localization and Analysis

Prediction-Based Gesture Detection in Lecture Videos by Combining Visual, Speech and Electronic Slides

GestureLens: Visual Analysis of Gestures in Presentation Videos

A multi-purpose automatic editing system based on lecture semantics for remote education

Intelligent Video Editing: Incorporating Modern Talking Face Generation Algorithms in a Video Editor

Content Extraction from Lecture Video via Speaker Action Classification Based on Pose Information

Intelligent Interface: Enhancing Lecture Engagement with Didactic Activity Summaries

Automating camera management for lecture room environments.

ExpressEdit: Video Editing with Natural Language and Sketching

DeepFaceVideoEditing

Developing a Lecture Video Recording System Using Augmented Reality

Videography for Telepresentations

A New Visual Interface for Searching and Navigating Slide-Based Lecture Videos