Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation)

Yunzhong Hou,Liang Zheng

DOI: https://doi.org/10.48550/arXiv.2108.05888

2021-08-13

Abstract:Multiview detection incorporates multiple camera views to deal with occlusions, and its central problem is multiview aggregation. Given feature map projections from multiple views onto a common ground plane, the state-of-the-art method addresses this problem via convolution, which applies the same calculation regardless of object locations. However, such translation-invariant behaviors might not be the best choice, as object features undergo various projection distortions according to their positions and cameras. In this paper, we propose a novel multiview detector, MVDeTr, that adopts a newly introduced shadow transformer to aggregate multiview information. Unlike convolutions, shadow transformer attends differently at different positions and cameras to deal with various shadow-like distortions. We propose an effective training scheme that includes a new view-coherent data augmentation method, which applies random augmentations while maintaining multiview consistency. On two multiview detection benchmarks, we report new state-of-the-art accuracy with the proposed system. Code is available at <a class="link-external link-https" href="https://github.com/hou-yz/MVDeTr" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

This paper attempts to solve the multi - view aggregation problem in multi - view detection. Specifically, existing methods usually use convolution to process feature maps projected from multiple views onto the common ground plane, but this method exhibits translational invariance under different positions and cameras, that is, the same calculation method is applied to all positions. However, since object features will be distorted in various ways according to their positions and cameras during the projection process, this translation - invariant behavior may not be the best choice. Therefore, the paper proposes a new multi - view detector MVDeTr, which uses the newly introduced Shadow Transformer to aggregate multi - view information. The Shadow Transformer can allocate different attention according to different positions and cameras to deal with various shadow - like distortions. In addition, the paper also proposes an effective training scheme, including a new view - consistency data augmentation method, which applies random augmentation while maintaining multi - view consistency. Through these improvements, the paper reports new state - of - the - art accuracies on two multi - view detection benchmarks.

Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation)

Multi-View Domain Adaptive Object Detection on Camera Networks.

Multiview Detection with Feature Perspective Transformation

Multiview Transformers for Video Recognition

Query-Based Multiview Detection for Multiple Visual Sensor Networks

Voxelized 3D Feature Aggregation for Multiview Detection

DVPE: Divided View Position Embedding for Multi-View 3D Object Detection

Multi-view Aggregation for Real-Time Accurate Object Detection of a Moving Camera

Learning to Learn Multiview Detection by Camera-Aware Attention

MV-DETR: Multi-modality indoor object detection by Multi-View DEtecton TRansformers

UVaT: Uncertainty Incorporated View-Aware Transformer for Robust Multi-View Classification

DVANet: Disentangling View and Action Features for Multi-View Action Recognition

Efficient Multi-View Fusion and Flexible Adaptation to View Missing in Cardiovascular System Signals

M&M3D: Multi-Dataset Training and Efficient Network for Multi-view 3D Object Detection

A Multiview Approach to Robust Detection in the Presence of Cast Shadows.

3M3D: Multi-view, Multi-path, Multi-representation for 3D Object Detection

M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers

MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer

Multi-View Transformer for 3D Visual Grounding

MVAFG: Multiview Fusion and Advanced Feature Guidance Change Detection Network for Remote Sensing Images

Multi-view and multi-augmentation for self-supervised visual representation learning