Abstract:Recent deep learning-based multi-view people detection (MVD) methods have shown promising results on existing datasets. However, current methods are mainly trained and evaluated on small, single scenes with a limited number of multi-view frames and fixed camera views. As a result, these methods may not be practical for detecting people in larger, more complex scenes with severe occlusions and camera calibration errors. This paper focuses on improving multi-view people detection by developing a supervised view-wise contribution weighting approach that better fuses multi-camera information under large scenes. Besides, a large synthetic dataset is adopted to enhance the model's generalization ability and enable more practical evaluation and comparison. The model's performance on new testing scenes is further improved with a simple domain adaptation technique. Experimental results demonstrate the effectiveness of our approach in achieving promising cross-scene multi-view people detection performance. See code here: https://vcc.tech/research/2024/MVD.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are as follows: Currently, multi - view people detection (MVD) methods based on deep learning are trained and evaluated on datasets with small scenes, a limited number of frames, and fixed camera viewpoints, resulting in poor performance of these methods in larger and more complex real - world scenarios. Specifically, existing methods have the following three main problems: 1. **Limited scene scale**: Existing MVD methods are mainly evaluated in small scenes of about 20 meters by 20 meters, while the scenes in practical applications may be much larger and there are more severe occlusions and camera calibration errors. 2. **Limited data volume and camera viewpoints**: Existing datasets contain a relatively small number of frames (for example, the Wildtrack dataset has only a few hundred frames), and the camera viewpoints are fixed (for example, Wildtrack has 7 viewpoints and MultiviewX has 6 viewpoints). This restricts the full verification and comparison of different methods. 3. **Poor generalization ability**: Existing methods are trained on a single scene and are prone to over - fitting specific camera layouts, and it is difficult to generalize to new, unseen scenes and different camera layouts. To solve these problems, this paper proposes a supervised view - wise contribution weighting method to better fuse multi - camera information, especially in large - scale scenes. In addition, the author also uses a large - scale synthetic dataset to enhance the generalization ability of the model and further improves the performance of the model in new test scenarios through simple domain adaptation techniques. ### Specific problem descriptions - **Multi - view people detection in large - scale scenes**: How to achieve accurate people detection in larger and more complex scenes, especially in the presence of severe occlusions and camera calibration errors. - **Improving the generalization ability of the model**: How to make the model adapt to new, unseen scenes and different camera layouts, rather than being limited to the single scene used during training. - **Limitations of datasets**: How to overcome the limitations of existing datasets in terms of scene scale, number of frames, and camera viewpoints in order to more comprehensively evaluate and compare different MVD methods. By solving these problems, this paper aims to extend multi - view people detection to more challenging and practical application scenarios.

Multi-View People Detection in Large Scenes via Supervised View-Wise Contribution Weighting

Multi-View Domain Adaptive Object Detection on Camera Networks.

Multiview Detection with Feature Perspective Transformation

A Multi-view 3D Vehicle Detection Method Based On Novel 3D Proposal Generation Method

Unsupervised Multi-view Pedestrian Detection

DVPE: Divided View Position Embedding for Multi-View 3D Object Detection

Self-supervised Multi-view Multi-Human Association and Tracking

Multi-View Matching (MVM): Facilitating Multi-Person 3D Pose Estimation Learning with Action-Frozen People Video

A Deep Top-Down Framework Towards Generalisable Multi-View Pedestrian Detection

Multi-View Multi-Human Association With Deep Assignment Network

MVM3Det: A Novel Method for Multi-view Monocular 3D Detection

Query-Based Multiview Detection for Multiple Visual Sensor Networks

Multiview Detection with Cardboard Human Modeling

Multi-View Attentive Contextualization for Multi-View 3D Object Detection

MMRDN: Consistent Representation for Multi-View Manipulation Relationship Detection in Object-Stacked Scenes

Unveiling the Power of Self-supervision for Multi-view Multi-human Association and Tracking

Scaling Multi-Camera 3D Object Detection through Weak-to-Strong Eliciting

Multi-scale deep multi-view subspace clustering with self-weighting fusion and structure preserving

Learning to Learn Multiview Detection by Camera-Aware Attention

M&M3D: Multi-Dataset Training and Efficient Network for Multi-view 3D Object Detection

Unsupervised multi-view stereo network based on multi-stage depth estimation