Video sketch: A middle-level representation for action recognition

Xing-Yuan Zhang,Ya-Ping Huang,Yang Mi,Yan-Ting Pei,Qi Zou,Song Wang
DOI: https://doi.org/10.1007/s10489-020-01905-y
IF: 5.3
2020-11-06
Applied Intelligence
Abstract:Different modalities extracted from videos, such as RGB and optical flows, may provide complementary cues for improving video action recognition. In this paper, we introduce a new modality named video sketch, which implies the human shape information, as a complementary modality for video action representation. We show that video action recognition can be enhanced by using the proposed video sketch. More specifically, we first generate video sketch with class distinctive action areas and then employ a two-stream network to combine the shape information extracted from image-based sketch and point-based sketch, followed by fusing the classification scores of two streams to generate shape representation for videos. Finally, we use the shape representation as the complementary one for the traditional appearance (RGB) and motion (optical flow) representations for the final video classification. We conduct extensive experiments on four human action recognition datasets – KTH, HMDB51, UCF101, Something-Something and UTI. The experimental results show that the proposed method outperforms the existing state-of-the-art action recognition methods.
computer science, artificial intelligence
What problem does this paper attempt to address?
The paper attempts to address the problem of how to more effectively utilize the new modality of video sketches in video action recognition to enhance recognition performance. Specifically: - **Introducing a new modality**: The paper introduces video sketches as a complementary modality to traditional appearance (RGB) and motion (optical flow) modalities to improve the accuracy of video action recognition. - **Generating attention-guided sketches**: By generating video sketches that contain class-discriminative action regions and using a dual-stream network to extract shape information from both image-based and point-based sketches, the classification scores of the two streams are fused to generate a shape representation of the video. - **Combining multiple modalities**: The shape representation is combined with traditional appearance and motion representations for the final video classification. - **Addressing two key issues**: - How to ignore irrelevant background information while generating action sketches, retaining only action-related information. - Given that video sketches consist of highly abstract lines, how to learn unique and powerful features from them. The main contribution of the paper is the proposal of a new architecture called the Video Sketch Action Network (VSA-Net), which can adaptively learn sketch-based action-related video representations and achieve state-of-the-art performance on several challenging action recognition benchmark datasets.