STF: Spatio-Temporal Fusion Module for Improving Video Object Detection

Noreen Anwar,Guillaume-Alexandre Bilodeau,Wassim Bouachir
2024-02-16
Abstract:Consecutive frames in a video contain redundancy, but they may also contain relevant complementary information for the detection task. The objective of our work is to leverage this complementary information to improve detection. Therefore, we propose a spatio-temporal fusion framework (STF). We first introduce multi-frame and single-frame attention modules that allow a neural network to share feature maps between nearby frames to obtain more robust object representations. Second, we introduce a dual-frame fusion module that merges feature maps in a learnable manner to improve them. Our evaluation is conducted on three different benchmarks including video sequences of moving road users. The performed experiments demonstrate that the proposed spatio-temporal fusion module leads to improved detection performance compared to baseline object detectors. Code is available at
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the problem of object detection in videos, particularly in situations where single-frame object detectors perform poorly due to occlusion, motion blur, or small object sizes. To improve these issues, the authors propose a Spatio-Temporal Fusion framework (STF) that enhances object detection accuracy by leveraging complementary information across multiple frames. Specifically, the paper presents the following contributions: 1. **Multi-Frame Attention Module (MFA)**: By introducing temporal convolution, a spatio-temporal attention mechanism is performed on the extracted feature maps, assigning adaptive temporal weights to each frame to enhance the detection capability of occluded or blurred objects. 2. **Single-Frame Attention Module (SFA)**: By weighting the feature maps of the current frame in both channel and spatial dimensions, the possibility of false detections is reduced. 3. **Dual-Frame Fusion Module**: By fusing single-frame and multi-frame feature maps at different scales, detection accuracy is improved under challenging conditions such as occlusion or motion blur. Experimental results on three different datasets (KITTI MOT, Cityscapes, and UA VDT) demonstrate that this method achieves significant performance improvements in object detection at various scales compared to existing single-frame or multi-frame detectors.