Weakly-supervised Disentanglement Network for Video Fingerspelling Detection
Ziqi Jiang,Shengyu Zhang,Siyuan Yao,Wenqiao Zhang,Sihan Zhang,Juncheng Li,Zhou Zhao,Fei Wu
DOI: https://doi.org/10.1145/3503161.3548213
2022-01-01
Abstract:Fingerspelling detection, which aims to localize and recognize fingerspelling gestures in raw, untrimmed videos, is a nascent but important research area that could help bridge the communication gap between deaf people and others. Many existing works tend to exploit additional knowledge, such as pose annotations, and newly datasets for performance improvement. However, in real-world applications, additional data collection and annotation require tremendous human efforts that are not always affordable. In this paper, we propose the Weakly-supervised Disentanglement Network, namely WED, that requires no additional knowledge, and better exploits the video-sentence weak supervisions. Specifically, WED incorporates two critical components: 1) Masked Disentanglement Module, which employs a Variational Autoencoder for signed letters disentanglement. Each latent factor in the VAE corresponds to a particular signed letter, and we mask latent factors corresponding to letters that do not appear in the video during decoding. Compared to the vanilla VAE, the masked reconstruction leverages the video-sentence weak supervision, leading to a better sign language oriented disentanglement; and 2) the Dynamic Memory Network module, which leverages the disentangled sign knowledge as prior knowledge and reference for sign-related frame identification and gesture recognition through a carefully designed memory reading component. We conduct extensive experiments on the benchmark ChicagoFSWild and ChicagoFSWild+ datasets. Empirical studies validate that the WED network achieves effective sign gesture disentanglement, contributing to the state-of-the-art performance for fingerspelling detection and recognition.