Abstract:We show that crowd counting can be viewed as a decomposable point querying process. This formulation enables arbitrary points as input and jointly reasons whether the points are crowd and where they locate. The querying processing, however, raises an underlying problem on the number of necessary querying points. Too few imply underestimation; too many increase computational overhead. To address this dilemma, we introduce a decomposable structure, i.e., the point-query quadtree, and propose a new counting model, termed Point quEry Transformer (PET). PET implements decomposable point querying via data-dependent quadtree splitting, where each querying point could split into four new points when necessary, thus enabling dynamic processing of sparse and dense regions. Such a querying process yields an intuitive, universal modeling of crowd as both the input and output are interpretable and steerable. We demonstrate the applications of PET on a number of crowd-related tasks, including fully-supervised crowd counting and localization, partial annotation learning, and point annotation refinement, and also report state-of-the-art performance. For the first time, we show that a single counting model can address multiple crowd-related tasks across different learning paradigms. Code is available at <a class="link-external link-https" href="https://github.com/cxliu0/PET" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve
This paper aims to address several key issues in crowd counting:
1. **Intuitiveness and Generality**: Existing methods typically estimate crowd numbers by predicting density maps, but this approach cannot provide instance-level information, i.e., it cannot pinpoint the exact location of each individual. This limits the ability to perform advanced analysis of the crowd. The paper proposes a new point-query method that can directly output the location of each individual, thereby providing a more intuitive understanding of the crowd.
2. **Dynamic Handling of Sparse and Dense Areas**: In crowd counting, the input image may contain any number of people, and the number of predefined query points is a challenge. Too few query points lead to underestimation, while too many increase computational overhead. The paper introduces a decomposable structure—the point-query quadtree—that can dynamically split query points as needed, efficiently handling both sparse and dense areas.
3. **Multi-task Processing Capability**: Existing methods are often custom-designed for specific counting tasks or learning paradigms, limiting their use in different applications. The proposed method can be applied to multiple crowd-related tasks, including fully supervised crowd counting and localization, partially annotated learning, and point annotation refinement, demonstrating its versatility and flexibility.
### Main Contributions
1. **Proposed a Decomposable Point-Query Method**: Treats crowd counting as a decomposable point-query process, allowing the model to receive arbitrary points as input and determine whether these points belong to the crowd and their locations. This design provides an intuitive and general crowd modeling approach.
2. **Introduced the Point-Query Quadtree**: Through a data-dependent splitting mechanism, dynamically generates query points to adapt to the handling of sparse and dense areas. This not only improves the model's efficiency but also ensures robustness in different scenarios.
3. **Implemented the Point-Query Transformer (PET)**: Based on the point-query quadtree and progressive rectangular window attention mechanism, constructed an efficient transformer model that achieves state-of-the-art performance in multiple crowd-related tasks.
### Experimental Results
1. **Fully Supervised Crowd Counting and Localization**: PET achieved state-of-the-art performance on multiple benchmark datasets, particularly on the ShanghaiTech PartA dataset, with a mean absolute error (MAE) of 49.34.
2. **Partially Annotated Learning**: PET also performed excellently in partially annotated learning tasks, able to infer information from annotated areas to unannotated areas, significantly outperforming existing methods.
3. **Point Annotation Refinement**: PET can identify and correct "noisy" annotation points, moving them to the head center position, improving annotation accuracy.
In summary, this paper proposes a new point-query method and point-query quadtree structure, not only addressing key issues in crowd counting but also demonstrating superior performance and broad applicability in multiple related tasks.