Joint Embedding with Multi-Task Learning for Multi-Label Zero-Shot Action Recognition

Rongqiao An,Zhenjiang Miao,Qingyu Li
DOI: https://doi.org/10.1109/icsp.2018.8652415
2018-01-01
Abstract:Action recognition, one of the most significant fields of computer vision, has gain great success over the past few years mainly due to the popularity of deep and data-driven architecture. However, when it comes to a real-world scenario that includes unseen actions, the previous methods which are supervised directly by labels may lead to two main problems, namely domain adaptation and laborious annotation. Therefore, we propose a novel joint space model of visual data and semantic information to address zero-shot action recognition problems. Furthermore, in order to take better advantage of semantic relationship between seen and unseen classes by word vectors, auto-encoder is introduced to our framework to narrow the semantic gap and improve the recognition accuracy. Moreover, we utilize the ε-greedy based max-pooling technique to select the most relevant visual segment from an instance according to the label. Finally, we employ multi-task learning to overall optimize classification and ranking tasks in joint latent space. We evaluate our work on a weakly annotated human action dataset Charades. The experimental results demonstrate that the proposed method significantly outperforms the state-of-the-arts in both accuracy and efficiency.
What problem does this paper attempt to address?