Learning Temporal Information and Object Relation for Zero-Shot Action Recognition

Qiuping Qi,Hanli Wang,Taiyi Su,Xianhui Liu
DOI: https://doi.org/10.1016/j.displa.2022.102177
IF: 3.074
2022-01-01
Displays
Abstract:It is challenging to achieve zero-shot action recognition. Current approaches utilize the names or classification scores of the detected objects to model object relations in images, and the recognition performances highly rely on the accuracy of object classification. In fact, humans have the capability to infer unseen action categories using visual knowledge such as motion patterns and object relations. In this work, a novel model is proposed for zero-shot action recognition, which jointly captures object relations of one static frame and models temporal motion patterns of adjacent frames. Specifically, an object detector first detects and extracts object features. Then graph convolutions are conducted to effectively leverage the relations of objects. Meanwhile, three-dimensional convolutional neural networks are adopted to model temporal information. Finally, the above two outputs are separately fed into visual-to-semantic modules to project the visual features into the semantic space. Moreover, a prior knowledge learning method is devised to introduce visual commonsense knowledge with the help of an external dataset. Extensive experiments are conducted on three benchmark datasets of Olympic Sports, HMDB51, and UCF101 to demonstrate the superiority of the proposed model compared to the state-of-the-art methods.
What problem does this paper attempt to address?