Abstract:In video captioning, many pioneering approaches have been developed to generate higher-quality captions by exploring and adding new video feature modalities. However, as the number of modalities increases, the negative interaction between them gradually reduces the gain of caption generation. To address this problem, we propose a three-layer hierarchical attention network based on a bidirectional decoding transformer that enhances multimodal features. In the first layer, we execute different encoders according to the characteristics of each modality to enhance the vector representation of each modality. Then, in the second layer, we select keyframes from all sampled frames of the modality by calculating the attention value between the generated words and each frame of the modality. Finally, in the third layer, we allocate weights to different modalities to reduce redundancy between them before generating the current word. Additionally, we use a bidirectional decoder to consider the context of the ground-truth caption when generating captions. Experiments on two mainstream benchmark datasets, MSVD and MSR-VTT, demonstrate the effectiveness of our proposed model. The model achieves state-of-the-art performance in significant metrics, and the generated sentences are more in line with human language habits. Overall, our three-layer hierarchical attention network based on a bidirectional decoding transformer effectively enhances multimodal features and generates high-quality video captions. Codes are available on https://github.com/nickchen121/MHAN.

HMNet: a Hierarchical Multi-Modal Network for Educational Video Concept Prediction

Following the Lecturer: Hierarchical Knowledge Concepts Prediction for Educational Videos

Multimodal-enhanced hierarchical attention network for video captioning

HCMS: Hierarchical and Conditional Modality Selection for Efficient Video Recognition

Adaptive Hierarchical Motion-Focused Model for Video Prediction.

HmcNet: A General Approach for Hierarchical Multi-Label Classification

Online video visual relation detection with hierarchical multi-modal fusion

Predictive Coding Based Multiscale Network with Encoder-Decoder LSTM for Video Prediction

Hierarchical Gate Network for Fine-Grained Visual Recognition.

Semantic Guided Level-Category Hybrid Prediction Network for Hierarchical Image Classification.

Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning

Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering

HiNet: Novel Multi-Scenario & Multi-Task Learning with Hierarchical Information Extraction

DHFNet: Decoupled Hierarchical Fusion Network for RGB-T dense prediction tasks

MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for Video Summarization

HSNet: an Intelligent Hierarchical Semantic-Aware Network System for Real-Time Semantic Segmentation

Consistency-aware Multi-modal Network for Hierarchical Multi-label Classification in Online Education System

Multiple Hypergraph Ranking for Video Concept Detection

CTHFNet: contrastive translation and hierarchical fusion network for text–video–audio sentiment analysis

Video Frame Prediction by Deep Multi-Branch Mask Network