Abstract:Dense object detection is widely used in automatic driving, video surveillance, and other fields. This paper focuses on the challenging task of dense object detection. Currently, detection methods based on greedy algorithms, such as non-maximum suppression (NMS), often produce many repetitive predictions or missed detections in dense scenarios, which is a common problem faced by NMS-based algorithms. Through the end-to-end DETR (DEtection TRansformer), as a type of detector that can incorporate the post-processing de-duplication capability of NMS, etc., into the network, we found that homogeneous queries in the query-based detector lead to a reduction in the de-duplication capability of the network and the learning efficiency of the encoder, resulting in duplicate prediction and missed detection problems. To solve this problem, we propose learnable differentiated encoding to de-homogenize the queries, and at the same time, queries can communicate with each other via differentiated encoding information, replacing the previous self-attention among the queries. In addition, we used joint loss on the output of the encoder that considered both location and confidence prediction to give a higher-quality initialization for queries. Without cumbersome decoder stacking and guaranteeing accuracy, our proposed end-to-end detection framework was more concise and reduced the number of parameters by about 8% compared to deformable DETR. Our method achieved excellent results on the challenging CrowdHuman dataset with 93.6% average precision (AP), 39.2% MR−2, and 84.3% JI. The performance overperformed previous SOTA methods, such as Iter-E2EDet (Progressive End-to-End Object Detection) and MIP (One proposal, Multiple predictions). In addition, our method is more robust in various scenarios with different densities.

Inferred box harmonization and aggregation for degraded face detection in crowds

Towards Accurate Dense Pedestrian Detection Via Occlusion-Prediction Aware Label Assignment and Hierarchical-Nms.

Boosting Detection in Crowd Analysis via Underutilized Output Features

Feature Agglomeration Networks for Single Stage Face Detection

ASFD: Automatic and Scalable Face Detector

Context feature fusion and enhanced non-maximum suppression for pedestrian detection in crowded scenes

HAMBox: Delving into Online High-quality Anchors Mining for Detecting Outer Faces

Hybrid attention network and center-guided non-maximum suppression for occluded face detection

Adaptive Scale and Spatial Aggregation for Real-Time Object Detection

Double Anchor R-CNN for Human Detection in a Crowd

Aggregation Connection Network for Tiny Face Detection

Dense pedestrian face detection in complex environments

FHEDN: A based on context modeling Feature Hierarchy Encoder-Decoder Network for face detection

SFA: Small Faces Attention Face Detector

Accurate Face Detection for High Performance

Dense Object Detection Based on De-Homogenized Queries

Robust Face Detection via Learning Small Faces on Hard Images

Mask Focal Loss: A unifying framework for dense crowd counting with canonical object detection networks

4AC-YOLOv5: an improved algorithm for small target face detection

Composite Backbone Small Object Detection Based on Context and Multi-Scale Information with Attention Mechanism

Dynamic Feature and Context Enhancement Network for Faster Detection of Small Objects