Abstract:Pedestrian segmentation and pose tracking are performed to infer human silhouettes and skeletons, respectively. Although the two tasks are complementary in nature, few works have been done on combining them together to improve each other, and some related methods are limited to still images. In this paper, we propose an approach to jointly solving them in monocular videos via a unified framework. Basically, the framework is built on EM-based maximum likelihood estimation, in which pose tracking is fulfilled through Bayesian filtering using body silhouette as an observation cue, and pedestrian segmentation is inferred by guided filtering with constraint of body skeleton. The two sets of parameters are alternatively updated along the video. In the initialization of the framework, we utilize a hierarchical shape matching scheme to obtain the silhouette and skeleton in the first frame. Experiments on challenging pedestrian datasets verify the approach's effectiveness to cluttered backgrounds, moving camera and various articulated bodies, and the performance is improved significantly by solving the two tasks together.

A UNIFIED FRAMEWORK FOR JOINT VIDEO PEDESTRIAN SEGMENTATION AND POSE TRACKING