Disentangled Human Action Video Generation Via Decoupled Learning.

Lingbo Yang,Zhenghui Zhao,Shiqi Wang,Shanshe Wang,Siwei Ma,Wen Gao
DOI: https://doi.org/10.1109/icmew.2019.00091
2019-01-01
Abstract:Recently there has been remarkable progress in synthesizing realistic human action videos by directly learning to translate pose heatmaps/stick figures to video frames in an end-to-end fashion. However, such models are not suitable for fashion-related applications that typically require flexible manipulations of visual attributes, such as the color of clothes. In this paper, we propose a disentangled human video generation framework conditioned on both the pose sequence and encoded color attributes. We aim to learn an encoder that captures the manifold structure of latent color space and a generator that fully utilizes the encoded color attributes to produce diversely-colored human action videos. To this end, we design a two-stage decoupled learning approach that uses a pre-trained color-aware encoder to guide the disentangled learning of the generator. Furthermore, a color augmentation approach is applied on raw video clips to better shape the distribution of samples in the latent color space. Comprehensive experimental results demonstrate the efficacy of our proposed methods.
What problem does this paper attempt to address?