Abstract:The mainstream crowd counting methods usually utilize the convolution neural network (CNN) to regress a density map, requiring point-level annotations. However, annotating each person with a point is an expensive and laborious process. During the testing phase, the point-level annotations are not considered to evaluate the counting accuracy, which means the point-level annotations are redundant. Hence, it is desirable to develop weakly-supervised counting methods that just rely on count-level annotations, a more economical way of labeling. Current weakly-supervised counting methods adopt the CNN to regress a total count of the crowd by an image-to-count paradigm. However, having limited receptive fields for context modeling is an intrinsic limitation of these weakly-supervised CNN-based methods. These methods thus cannot achieve satisfactory performance, with limited applications in the real world. The transformer is a popular sequence-to-sequence prediction model in natural language processing (NLP), which contains a global receptive field. In this paper, we propose TransCrowd, which reformulates the weakly-supervised crowd counting problem from the perspective of sequence-to-count based on transformers. We observe that the proposed TransCrowd can effectively extract the semantic crowd information by using the self-attention mechanism of transformer. To the best of our knowledge, this is the first work to adopt a pure transformer for crowd counting research. Experiments on five benchmark datasets demonstrate that the proposed TransCrowd achieves superior performance compared with all the weakly-supervised CNN-based counting methods and gains highly competitive counting performance compared with some popular fully-supervised counting methods.

CrowdTrans: Learning top-down visual perception for crowd counting by transformer

Relevant Region Prediction for Crowd Counting

Scale Pyramid Network For Crowd Counting

CCTrans: Simplifying and Improving Crowd Counting with Transformer

Multi-branch Progressive Embedding Network for Crowd Counting

TransCrowd: weakly-supervised crowd counting with transformers

Crowd Transformer Network

Semantic-refined Spatial Pyramid Network for Crowd Counting

Counting Varying Density Crowds Through Density Guided Adaptive Selection CNN and Transformer Estimation

An interactive network based on transformer for multimodal crowd counting

RGB-T Multi-Modal Crowd Counting Based on Transformer

Gramformer: Learning Crowd Counting via Graph-Modulated Transformer

Rethinking Global Context in Crowd Counting

CLDE-Net: crowd localization and density estimation based on CNN and transformer network

Crowd Counting Based on Multiresolution Density Map and Parallel Dilated Convolution

Transformer-CNN Hybrid Network for Crowd Counting

DTCC: Multi-level dilated convolution with transformer for weakly-supervised crowd counting

Concise Convolutional Neural Network for Crowd Counting

Audio-Visual Transformer Based Crowd Counting

CC-DETR: DETR with Hybrid Context and Multi-Scale Coordinate Convolution for Crowd Counting

Density-Aware Multi-Task Learning for Crowd Counting