Enhancing Weakly-Supervised Object Detection on Static Images through (Hallucinated) Motion

Cagri Gungor,Adriana Kovashka
2024-09-15
Abstract:While motion has garnered attention in various tasks, its potential as a modality for weakly-supervised object detection (WSOD) in static images remains unexplored. Our study introduces an approach to enhance WSOD methods by integrating motion information. This method involves leveraging hallucinated motion from static images to improve WSOD on image datasets, utilizing a Siamese network for enhanced representation learning with motion, addressing camera motion through motion normalization, and selectively training images based on object motion. Experimental validation on the COCO and YouTube-BB datasets demonstrates improvements over a state-of-the-art method.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to enhance weakly - supervised object detection (WSOD) in static images by introducing motion information. Specifically, existing WSOD methods mainly rely on the appearance information of RGB images for object detection, but these methods have limitations when dealing with dynamic scenes. The author believes that motion information can provide time - dynamic features complementary to appearance information, thus helping to locate objects more accurately. However, it is impossible to directly obtain real motion information in static images. Therefore, the author proposes an innovative method, that is, enhancing the effect of WSOD through "hallucinated motion". ### Main problems and solutions 1. **Introducing motion information**: - **Hallucinated motion**: Generate simulated motion information from static images to make up for the lack of real motion data in static images. - **Siamese network**: Use the Siamese network for contrastive learning, and combine RGB images and hallucinated motion images to enhance representation learning. - **Motion normalization**: Reduce the interference caused by camera motion and ensure that the extracted motion information is more accurate. 2. **Selecting training images with significant motion**: - In order to further improve the model performance, the author proposes a selection strategy based on object motion, only selecting those images containing significant object motion for training, thereby reducing the noise brought by low - quality motion or no - motion images. 3. **Experimental verification**: - Experiments were carried out on the COCO and YouTube - BB datasets, and the results show that after introducing motion information, the detection performance of the model has been significantly improved. ### Formula summary - **Detection and classification score calculation**: \[ v_{\text{det}}^{i,c} = w_{\text{det}}^{\top c} \phi(v_i) + b_{\text{det}}^c \] \[ v_{\text{cls}}^{i,c} = w_{\text{cls}}^{\top c} \phi(v_i) + b_{\text{cls}}^c \] - **Probability conversion**: \[ p_{\text{det}}^{i,c} = \frac{\exp(v_{\text{det}}^{i,c})}{\sum_{k = 1}^R \exp(v_{\text{det}}^{k,c})} \] \[ p_{\text{cls}}^{i,c} = \frac{\exp(v_{\text{cls}}^{i,c})}{\sum_{k = 1}^C \exp(v_{\text{cls}}^{i,k})} \] - **Image - level prediction**: \[ \hat{p}_c = \sigma\left(\sum_{i = 1}^X p_{\text{det}}^{i,c}p_{\text{cls}}^{i,c}\right) \] - **Multi - instance learning loss**: \[ L_{\text{mil}} = -\sum_{c = 1}^C\left[y_c\log\hat{p}_c+(1 - y_c)\log(1 - \hat{p}_c)\right] \] - **Cosine similarity calculation**: \[ S(I, M) = \frac{\langle\psi_{\text{proj}}(I),\psi_{\text{proj}}(M)\rangle}{\rho} \] - **NCE loss**: \[ L_{M\rightarrow I}=-\frac{1}{|B|}\sum_{(I, M)\in B}\log\frac{\ex