Temporal Aggregation with Context Focusing for Few-Shot Video Object Detection

Wentao Han,Jie Lei,Fahong Wang,Zunlei Feng,Ronghua Liang
DOI: https://doi.org/10.1109/smc53992.2023.10394197
2023-01-01
Abstract:Few-shot video object detection focuses on finding all the objects in a given query video that belong to the same class, given only a few support images of the target object in an unseen class. Unfortunately, due to the object blur or occlusion in video frames, using single-frame object detection directly will greatly limit the accuracy. The issue is significantly worse in few-shot settings due to insufficient support and timedomain information. In this paper, we propose a temporal aggregation with context focusing framework (TACF) for few-shot video object detection, which aims to fully use the information between support images and adjacent video frames. The context focusing module effectively encodes the target object in adjacent frames according to the support images. Afterward, the temporal aggregation module implicitly extracts the most similar ROI features from these adjacent frames to obtain the target proposals. In the end, the matching network determines the category and bounding box by calculating the distance with the support images. Extensive experimental evaluations on FSVOD and FSYTV databases show that our method achieves more competitive results than image-based methods, naive video-based extensions, and the state-of-the-art few-shot video object detection method.
What problem does this paper attempt to address?