Abstract:While motion has garnered attention in various tasks, its potential as a modality for weakly-supervised object detection (WSOD) in static images remains unexplored. Our study introduces an approach to enhance WSOD methods by integrating motion information. This method involves leveraging hallucinated motion from static images to improve WSOD on image datasets, utilizing a Siamese network for enhanced representation learning with motion, addressing camera motion through motion normalization, and selectively training images based on object motion. Experimental validation on the COCO and YouTube-BB datasets demonstrates improvements over a state-of-the-art method.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to enhance weakly - supervised object detection (WSOD) in static images by introducing motion information. Specifically, existing WSOD methods mainly rely on the appearance information of RGB images for object detection, but these methods have limitations when dealing with dynamic scenes. The author believes that motion information can provide time - dynamic features complementary to appearance information, thus helping to locate objects more accurately. However, it is impossible to directly obtain real motion information in static images. Therefore, the author proposes an innovative method, that is, enhancing the effect of WSOD through "hallucinated motion". ### Main problems and solutions 1. **Introducing motion information**: - **Hallucinated motion**: Generate simulated motion information from static images to make up for the lack of real motion data in static images. - **Siamese network**: Use the Siamese network for contrastive learning, and combine RGB images and hallucinated motion images to enhance representation learning. - **Motion normalization**: Reduce the interference caused by camera motion and ensure that the extracted motion information is more accurate. 2. **Selecting training images with significant motion**: - In order to further improve the model performance, the author proposes a selection strategy based on object motion, only selecting those images containing significant object motion for training, thereby reducing the noise brought by low - quality motion or no - motion images. 3. **Experimental verification**: - Experiments were carried out on the COCO and YouTube - BB datasets, and the results show that after introducing motion information, the detection performance of the model has been significantly improved. ### Formula summary - **Detection and classification score calculation**: \[ v_{\text{det}}^{i,c} = w_{\text{det}}^{\top c} \phi(v_i) + b_{\text{det}}^c \] \[ v_{\text{cls}}^{i,c} = w_{\text{cls}}^{\top c} \phi(v_i) + b_{\text{cls}}^c \] - **Probability conversion**: \[ p_{\text{det}}^{i,c} = \frac{\exp(v_{\text{det}}^{i,c})}{\sum_{k = 1}^R \exp(v_{\text{det}}^{k,c})} \] \[ p_{\text{cls}}^{i,c} = \frac{\exp(v_{\text{cls}}^{i,c})}{\sum_{k = 1}^C \exp(v_{\text{cls}}^{i,k})} \] - **Image - level prediction**: \[ \hat{p}_c = \sigma\left(\sum_{i = 1}^X p_{\text{det}}^{i,c}p_{\text{cls}}^{i,c}\right) \] - **Multi - instance learning loss**: \[ L_{\text{mil}} = -\sum_{c = 1}^C\left[y_c\log\hat{p}_c+(1 - y_c)\log(1 - \hat{p}_c)\right] \] - **Cosine similarity calculation**: \[ S(I, M) = \frac{\langle\psi_{\text{proj}}(I),\psi_{\text{proj}}(M)\rangle}{\rho} \] - **NCE loss**: \[ L_{M\rightarrow I}=-\frac{1}{|B|}\sum_{(I, M)\in B}\log\frac{\ex

Enhancing Weakly-Supervised Object Detection on Static Images through (Hallucinated) Motion

Spatial Likelihood Voting with Self-Knowledge Distillation for Weakly Supervised Object Detection.

SLV: Spatial Likelihood Voting for Weakly Supervised Object Detection

Boosting Weakly Supervised Object Detection using Fusion and Priors from Hallucinated Depth

Few-shot Weakly-Supervised Object Detection via Directional Statistics

Towards Object Detection from Motion

Self-supervised Motion Learning from Static Images

Motion-Aware Memory Network for Fast Video Salient Object Detection

Learning Via Watching: A Weakly Supervised Moving Object Detector for Satellite Videos

Hallucinated Adversarial Learning for Robust Visual Tracking

Contrastive Proposal Extension With LSTM Network for Weakly Supervised Object Detection

Weakly Supervised Object Detection with Symmetry Context

A Novel Video Salient Object Detection Method via Semi-supervised Motion Quality Perception

Recurrent Self-Optimizing Proposals for Weakly Supervised Object Detection

HUWSOD: Holistic Self-training for Unified Weakly Supervised Object Detection

Boosting Object Representation Learning via Motion and Object Continuity

Hallucination Improves the Performance of Unsupervised Visual Representation Learning

Saliency Guided End-to-End Learning for Weakly Supervised Object Detection.

PCL: Proposal Cluster Learning for Weakly Supervised Object Detection

Saliency Guided End-to-end Learning Forweakly Supervised Object Detection