Objects as Points

Xingyi Zhou,Dequan Wang,Philipp Krähenbühl
DOI: https://doi.org/10.48550/arXiv.1904.07850
2019-04-26
Abstract:Detection identifies objects as axis-aligned boxes in an image. Most successful object detectors enumerate a nearly exhaustive list of potential object locations and classify each. This is wasteful, inefficient, and requires additional post-processing. In this paper, we take a different approach. We model an object as a single point --- the center point of its bounding box. Our detector uses keypoint estimation to find center points and regresses to all other object properties, such as size, 3D location, orientation, and even pose. Our center point based approach, CenterNet, is end-to-end differentiable, simpler, faster, and more accurate than corresponding bounding box based detectors. CenterNet achieves the best speed-accuracy trade-off on the MS COCO dataset, with 28.1% AP at 142 FPS, 37.4% AP at 52 FPS, and 45.1% AP with multi-scale testing at 1.4 FPS. We use the same approach to estimate 3D bounding box in the KITTI benchmark and human pose on the COCO keypoint dataset. Our method performs competitively with sophisticated multi-stage methods and runs in real-time.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### The Problems This Paper Attempts to Solve This paper aims to address the following main issues: 1. **Simplifying Object Detection**: - Current object detection methods rely on enumerating a large number of potential object locations and classifying each one, which is both inefficient and resource-intensive. - The paper proposes a new method that represents objects as a point at the center of their bounding box, thereby simplifying the object detection process. 2. **Improving Detection Speed and Accuracy**: - By using keypoint estimation to find the center point of objects and regressing to other attributes (such as size, 3D position, orientation, etc.), the entire detection process becomes more efficient and accurate. - The proposed method is called CenterNet, which is an end-to-end differentiable model that is faster and more accurate than existing bounding box-based detectors. 3. **Avoiding Non-Maximum Suppression (NMS)**: - Most current detectors require additional post-processing steps (such as NMS), which makes the model difficult to train end-to-end. - CenterNet avoids NMS by directly extracting local peaks from the keypoint heatmap, thereby simplifying the entire process. 4. **Extending to Other Tasks**: - This method is not only applicable to 2D object detection but can also be extended to 3D object detection and multi-person pose estimation tasks. Overall, this paper attempts to simplify the object detection process and improve its speed and accuracy through a new object representation method (i.e., the center point).