Abstract:Most detection models employ many detection heads to output their prediction results independently. However, the locality of convolutional neural networks (CNN) causes the features extracted by adjacent convolution kernels to be very similar, which leads to duplicate prediction results. To tackle this issue, the hand-designed non-maximum suppression (NMS) procedure is proposed to remove the duplicate results. However, the NMS procedure cannot be applied to certain scenarios, such as the crowd scenarios, and requires careful adjustment of hyper-parameters. Therefore, end-to-end training is necessary to improve the detection ability on more scenarios. To this end, we propose a model that enables the network to adaptively identify duplicate objects and output non-repetitive results, which can effectively replace the hand-designed non-maximum suppression procedure. By adding differentiated priors to image features, and using Multi-Head Attention to enhance the global communication between features, our model can detect objects in an end-to-end manner. Our model can be easily applied to traditional one-stage detectors, e.g., FCOS and RetinaNet. While fast convergence and high recall rate are achieved, the accuracy is also significantly better than the baseline and outperforms many one-stage and two-stage methods. It also achieves the comparable performance as traditional detectors under the dense scene datasets CrowdHuman. Evaluation results demonstrate that our model with ResNet-50 can achieve 40.5% in $${\mathrm{AP}}$$ on COCO dataset and 89.2% in $${\mathrm{AP}}_{50}$$ on CrowdHuman dataset.

HeadNet: an End-to-End Adaptive Relational Network for Head Detection

FDN: Feature Decoupling Network for Head Pose Estimation.

Head Tracking by Means of Probabilistic Neural Networks

SADRNet: Self-Aligned Dual Face Regression Networks for Robust 3D Dense Face Alignment and Reconstruction

End-to-End Spatial Transform Face Detection and Recognition

Real-Time Facial Landmark Detection by Attention-driven Lightweight Network

Spatial Attention Network for Head Detection.

A Novel Convolutional Neural Network for Head Detection and Pose Estimation in Complex Environments from Single-Depth Images

ADNet: Leveraging Error-Bias Towards Normal Direction in Face Alignment

Head Detection Based on Convolutional Neural Network with Multi-Stage Weighted Feature

An effective head detection framework via convolutional neural networks

CephaNN: A Multi-Head Attention Network for Cephalometric Landmark Detection.

UniHead: Unifying Multi-Perception for Detection Heads

RESC: REfine the SCore with Adaptive Transformer Head for End-to-end Object Detection

Self-Attention Mechanism-Based Head Pose Estimation Network with Fusion of Point Cloud and Image Features

An Effective Deep Network for Head Pose Estimation without Keypoints

Attention-Guided Huber Loss for Head Pose Estimation Based on Improved Capsule Network

Human Head Pose Estimation Through Temporal Enhanced and Accurate Self-Supervised Depth Prediction

HCNET: A Point Cloud Object Detection Network Based on Height and Channel Attention

Hierarchical Reasoning Network for Human-Object Interaction Detection

Joint Human Detection and Head Pose Estimation Via Multistream Networks for RGB-D Videos