Motional foreground attention-based video crowd counting

Miaogen Ling,Tianhang Pan,Yi Ren,Ke Wang,Xin Geng
DOI: https://doi.org/10.1016/j.patcog.2023.109891
IF: 8
2023-08-24
Pattern Recognition
Abstract:In this paper, we tackle the problem of video crowd counting. Compared with single image crowd counting, video provides gradual spatial and temporal variation information that would help to strengthen the robustness of crowd counting. Therefore, it is critical to make full use of neighboring frames both in feature extraction and final prediction for current frame's estimation. Based on the above observations, we propose a motional foreground attention-based video crowd counting method. Specifically, we first leverage an foreground estimation module based on ConvNeXt to extract the motional features from bidirectional frame differences and output a foreground estimation map. Then the motional features combined with the static features of current frame are sent into feature fusion network, where foreground estimation map is transformed as attention weights for crowd number estimation. Three new indoor video datasets are manually annotated. The proposed method achieves state-of-the-art performance on all indoor and outdoor video datasets.
computer science, artificial intelligence,engineering, electrical & electronic
What problem does this paper attempt to address?
The problem that this paper attempts to solve is video crowd counting. Compared with crowd counting based on a single image, videos provide gradual spatial and temporal change information, which helps to enhance the robustness of crowd counting. Therefore, making full use of the information of adjacent frames in feature extraction and final prediction is crucial for the estimation of the current frame. Based on the above observations, the author proposes a video crowd - counting method based on motion foreground attention. Specifically, this method first uses a ConvNeXt - based foreground estimation module to extract motion features from bidirectional frame differences and outputs a foreground estimation map. Then, the motion features are combined with the static features of the current frame and sent to the feature fusion network. In this process, the foreground estimation map is converted into attention weights for the estimation of the crowd number. In addition, the author manually labels three new indoor video datasets to better evaluate the performance in indoor scenes. The proposed model achieves state - of - the - art performance on all indoor and outdoor video datasets. The main contributions of the paper include: 1. Proposing the use of bidirectional frame differences to model the spatial and temporal correlations between adjacent frames, which are applied in both feature extraction and final prediction, thereby improving the robustness of the model. 2. Adding an up - sampling block in ConvNeXt for the video crowd - counting task. As far as the author knows, this is the first time that ConvNeXt has been applied to the field of video crowd - counting. 3. Manually labeling three new video crowd - counting datasets and providing point annotations. The experimental results show that the proposed model outperforms the existing state - of - the - art crowd - counting methods on all indoor and outdoor video datasets.