Abstract:We propose a new architecture(based on Faster R-CNN framework) for people detection. Our model extracts the first, third, fifth stage of the VGG16 network to form a robust feature map which consists of both the semantic and localization information. Besides, we replace the fc6 and fc7 layer of the original structure with two convolution layers, since the fully-connected layer is so time-consuming. We finetune our network on the Brainwash dataset, and it's partially initialized with the model trained on the imagenet dataset. The experimental results demonstrate great performance(with AP of 92.4% and recall of 93.5%), which exceeds our baseline methods a lot.

People detection in crowded scenes using hierarchical features