A Multi-Layer Memory Sharing Network for Video Captioning

Tian-Zi Niu,Shan -Shan Dong,Zhen-Duo Chen,Xin Luo,Zi Huang,Shanqing Guo,Xin-Shun Xu
DOI: https://doi.org/10.1016/j.patcog.2022.109202
IF: 8
2023-01-01
Pattern Recognition
Abstract:Over the past several years, video captioning has received much attention in computer vision and ma-chine learning communities. Many models utilize an RNN-based decoder to generate sentences describing the content of a video. They have achieved much progress; however, few methods adopt a decoder with more than three layers because an RNN-based model with more layers may become hard to train, time-consuming or even deteriorate at a certain depth. To address the limitation, we propose a Multi-layer memory sharing Network, MesNet for short, which allows more layers to be stacked without compro-mising performance. In MesNet, we construct a novel memory sharing structure to strengthen the con-nections between layers and make the model easier to train. More specifically, we design an Enhanced Gated Recurrent Unit (En-GRU) and stack it to construct a deeper network. Unlike traditional RNN-based multi-layer networks, the memory states of all layers in MesNet are cross-used at each iteration to mimic the brain's complex connections. Extensive experiments on MSVD and MSR-VTT demonstrate that our method performs well and outperforms some state-of-the-art methods significantly. Our code is available at https://github.com/nbbb/MesNet .(c) 2022 Elsevier Ltd. All rights reserved.
What problem does this paper attempt to address?