A dual-branch hybrid network of CNN and transformer with adaptive keyframe scheduling for video semantic segmentation

Zhixue Liang,Wenyong Dong,Bo Zhang
DOI: https://doi.org/10.1007/s00530-024-01262-7
IF: 3.9
2024-02-22
Multimedia Systems
Abstract:Video semantic segmentation (VSS) plays a crucial role in various realistic applications, such as unmanned vehicles, autonomous robots, and augmented reality. Despite the significant progress achieved in this field, balancing accuracy and efficiency remains a significant challenge. This paper presents a novel dual-branch hybrid network of CNN and Transformer with adaptive keyframe scheduling (DHN–AKS) to achieve higher accuracy and faster inference times for VSS. One branch uses a hierarchical transformer to extract high-level features on keyframes beneficial for segmentation accuracy in consideration of transformer's powerful ability of modeling global semantic information. The other branch uses a lightweight feature network (ResNet-18) to extract the low-level features on non-keyframes beneficial for segmentation efficiency. Moreover, we present a dynamically updating memory matrix that memorizes the significant semantic information of historical video frames, enabling the exploration of the temporal relevance of the current frame based on cross attention. Experiments on two benchmark data sets, Cityscapes and CamVid, demonstrate that our proposed framework achieves competitive performance in terms of accuracy and inference time against some previous state-of-the-art methods.
computer science, information systems, theory & methods
What problem does this paper attempt to address?