Abstract:We show that crowd counting can be viewed as a decomposable point querying process. This formulation enables arbitrary points as input and jointly reasons whether the points are crowd and where they locate. The querying processing, however, raises an underlying problem on the number of necessary querying points. Too few imply underestimation; too many increase computational overhead. To address this dilemma, we introduce a decomposable structure, i.e., the point-query quadtree, and propose a new counting model, termed Point quEry Transformer (PET). PET implements decomposable point querying via data-dependent quadtree splitting, where each querying point could split into four new points when necessary, thus enabling dynamic processing of sparse and dense regions. Such a querying process yields an intuitive, universal modeling of crowd as both the input and output are interpretable and steerable. We demonstrate the applications of PET on a number of crowd-related tasks, including fully-supervised crowd counting and localization, partial annotation learning, and point annotation refinement, and also report state-of-the-art performance. For the first time, we show that a single counting model can address multiple crowd-related tasks across different learning paradigms. Code is available at <a class="link-external link-https" href="https://github.com/cxliu0/PET" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address several key issues in crowd counting: 1. **Intuitiveness and Generality**: Existing methods typically estimate crowd numbers by predicting density maps, but this approach cannot provide instance-level information, i.e., it cannot pinpoint the exact location of each individual. This limits the ability to perform advanced analysis of the crowd. The paper proposes a new point-query method that can directly output the location of each individual, thereby providing a more intuitive understanding of the crowd. 2. **Dynamic Handling of Sparse and Dense Areas**: In crowd counting, the input image may contain any number of people, and the number of predefined query points is a challenge. Too few query points lead to underestimation, while too many increase computational overhead. The paper introduces a decomposable structure—the point-query quadtree—that can dynamically split query points as needed, efficiently handling both sparse and dense areas. 3. **Multi-task Processing Capability**: Existing methods are often custom-designed for specific counting tasks or learning paradigms, limiting their use in different applications. The proposed method can be applied to multiple crowd-related tasks, including fully supervised crowd counting and localization, partially annotated learning, and point annotation refinement, demonstrating its versatility and flexibility. ### Main Contributions 1. **Proposed a Decomposable Point-Query Method**: Treats crowd counting as a decomposable point-query process, allowing the model to receive arbitrary points as input and determine whether these points belong to the crowd and their locations. This design provides an intuitive and general crowd modeling approach. 2. **Introduced the Point-Query Quadtree**: Through a data-dependent splitting mechanism, dynamically generates query points to adapt to the handling of sparse and dense areas. This not only improves the model's efficiency but also ensures robustness in different scenarios. 3. **Implemented the Point-Query Transformer (PET)**: Based on the point-query quadtree and progressive rectangular window attention mechanism, constructed an efficient transformer model that achieves state-of-the-art performance in multiple crowd-related tasks. ### Experimental Results 1. **Fully Supervised Crowd Counting and Localization**: PET achieved state-of-the-art performance on multiple benchmark datasets, particularly on the ShanghaiTech PartA dataset, with a mean absolute error (MAE) of 49.34. 2. **Partially Annotated Learning**: PET also performed excellently in partially annotated learning tasks, able to infer information from annotated areas to unannotated areas, significantly outperforming existing methods. 3. **Point Annotation Refinement**: PET can identify and correct "noisy" annotation points, moving them to the head center position, improving annotation accuracy. In summary, this paper proposes a new point-query method and point-query quadtree structure, not only addressing key issues in crowd counting but also demonstrating superior performance and broad applicability in multiple related tasks.

Point-Query Quadtree for Crowd Counting, Localization, and More

Rethinking Counting and Localization in Crowds: A Purely Point-Based Framework

Uniformity in Heterogeneity: Diving Deep into Count Interval Partition for Crowd Counting

Improving Point-based Crowd Counting and Localization Based on Auxiliary Point Guidance

A Self-Training Approach for Point-Supervised Object Detection and Counting in Crowds

Point in, Box out: Beyond Counting Persons in Crowds

Decoupled Two-Stage Crowd Counting and Beyond

Deep Rank-Consistent Pyramid Model for Enhanced Crowd Counting

Enhancing and Dissecting Crowd Counting By Synthetic Data

Learning to Count via Unbalanced Optimal Transport

Counting moving people in crowds using motion statistics of feature-points

PANet: Perspective-Aware Network with Dynamic Receptive Fields and Self-Distilling Supervision for Crowd Counting

PQ-Transformer: Jointly Parsing 3D Objects and Layouts from Point Clouds.

Efficient Pig Counting in Crowds with Keypoints Tracking and Spatial-aware Temporal Response Filtering

Perspective-Guided Convolution Networks for Crowd Counting

CC-DETR: DETR with Hybrid Context and Multi-Scale Coordinate Convolution for Crowd Counting

CCTrans: Simplifying and Improving Crowd Counting with Transformer

Point Transformer V3: Simpler, Faster, Stronger

Crowd Counting via Perspective-Guided Fractional-Dilation Convolution

Compare and Focus: Multi-Scale View Aggregation for Crowd Counting

Uniformity in Heterogeneity:Diving Deep into Count Interval Partition for Crowd Counting