Recurrent convolutional video captioning with global and local attention.

Tao Jin,Yingming Li,Zhongfei Zhang

DOI: https://doi.org/10.1016/j.neucom.2019.08.042

IF: 6

2019-01-01

Neurocomputing

Abstract:•We propose a novel video captioning model with global-local attention.•We combine LSTM and 1D CNN in the decoder.•Our model outperforms the state-of-the-art on MSVD and MSR-VTT.

What problem does this paper attempt to address?

Learning Multimodal Attention LSTM Networks for Video Captioning.

Jun Xu,Ting Yao,Yongdong Zhang,Tao Mei

DOI: https://doi.org/10.1145/3123266.3123448

2017-01-01

Abstract:Automatic generation of video caption is a challenging task as video is an information-intensive media with complex variations. Most existing methods, either based on language templates or sequence learning, have treated video as a flat data sequence while ignoring intrinsic multimodality nature. Observing that different modalities (e.g., frame, motion, and audio streams), as well as the elements within each modality, contribute differently to the sentence generation, we present a novel deep framework to boost video captioning by learning Multimodal Attention Long-Short Term Memory networks (MA-LSTM). Our proposed MA-LSTM fully exploits both multimodal streams and temporal attention to selectively focus on specific elements during the sentence generation. Moreover, we design a novel child-sum fusion unit in the MA-LSTM to effectively combine different encoded modalities to the initial decoding states. Different from existing approaches that employ the same LSTM structure for different modalities, we train modality-specific LSTM to capture the intrinsic representations of individual modalities. The experiments on two benchmark datasets (MSVD and MSR-VTT) show that our MA-LSTM significantly outperforms the state-of-the-art methods with 52.3 [email protected] and 70.4 CIDER-D metrics on MSVD dataset, respectively.
Image Caption with Global-Local Attention

Linghui Li,Sheng Tang,Lixi Deng,Yongdong Zhang,Qi Tian

DOI: https://doi.org/10.1609/aaai.v31i1.11236

2017-01-01

Proceedings of the AAAI Conference on Artificial Intelligence

Abstract:Image caption is becoming important in the field of artificial intelligence. Most existing methods based on CNN-RNN framework suffer from the problems of object missing and misprediction due to the mere use of global representation at image-level. To address these problems, in this paper, we propose a global-local attention (GLA) method by integrating local representation at object-level with global representation at image-level through attention mechanism. Thus, our proposed method can pay more attention to how to predict the salient objects more precisely with high recall while keeping context information at image-level cocurrently. Therefore, our proposed GLA method can generate more relevant sentences, and achieve the state-of-the-art performance on the well-known Microsoft COCO caption dataset with several popular metrics.
Video Captioning Using Global-Local Representation

Liqi Yan,Siqi Ma,Qifan Wang,Yingjie Chen,Xiangyu Zhang,Andreas Savakis,Dongfang Liu

DOI: https://doi.org/10.1109/tcsvt.2022.3177320

IF: 5.859

2022-10-08

IEEE Transactions on Circuits and Systems for Video Technology

Abstract:Video captioning is a challenging task as it needs to accurately transform visual understanding into natural language description. To date, state-of-the-art methods inadequately model global-local vision representation for sentence generation, leaving plenty of room for improvement. In this work, we approach the video captioning task from a new perspective and propose a GLR framework, namely a global-local representation granularity. Our GLR demonstrates three advantages over the prior efforts. First, we propose a simple solution, which exploits extensive vision representations from different video ranges to improve linguistic expression. Second, we devise a novel global-local encoder, which encodes different video representations including long-range, short-range and local-keyframe, to produce rich semantic vocabulary for obtaining a descriptive granularity of video contents across frames. Finally, we introduce the progressive training strategy which can effectively organize feature learning to incur optimal captioning behavior. Evaluated on the MSR-VTT and MSVD dataset, we outperform recent state-of-the-art methods including a well-tuned SA-LSTM baseline by a significant margin, with shorter training schedules. Because of its simplicity and efficacy, we hope that our GLR could serve as a strong baseline for many video understanding tasks besides video captioning. Code will be available.

engineering, electrical & electronic
Divided Caption Model with Global Attention

Yamin Chen,Hancong Dua,Zitian Zhao,Zhi Wang

DOI: https://doi.org/10.1145/3461353.3461386

2021-01-01

Abstract:Dense video captioning is a newly emerging task that aims at both locating and describing all events in a video. We identify and tackle two challenges on this task, namely, 1) the limitation of just attending local features; 2) the severely degraded description and increased training complexity caused by the redundant information. In this paper, we propose a new divided caption model, where two different attention mechanisms are introduced to rectify the captioning process in a unified framework. Firstly, we employ a global attention mechanism to encode video features in the proposal module, which can obtain a better temporal boundary. Second, we design bidirectional Long short-term memory (LSTM) with a common-attention mechanism to counterpoise 3d-convolutional neural network (c3d) features and global attention video content effectively in caption module to generate coherent natural language descriptions. Besides, we divide forward and backward video features in an event into segments to relieve the stress for degraded description and increased complexity. Extensive experiments demonstrate the competitive performance of the proposed Divided Caption Model with Global Attention (DCM-GA) over state-of-the-art methods on the ActivityNet Captions dataset.
Video Captioning With Attention-Based LSTM and Semantic Consistency

Lianli Gao,Zhao Guo,Hanwang Zhang,Xing Xu,Heng Tao Shen

DOI: https://doi.org/10.1109/TMM.2017.2729019

IF: 7.3

2022-03-15

IEEE Transactions on Multimedia

Abstract:Recent progress in using long short-term memory (LSTM) for image captioning has motivated the exploration of their applications for video captioning. By taking a video as a sequence of features, an LSTM model is trained on video-sentence pairs and learns to associate a video to a sentence. However, most existing methods compress an entire video shot or frame into a static representation, without considering attention mechanism which allows for selecting salient features. Furthermore, existing approaches usually model the translating error, but ignore the correlations between sentence semantics and visual content. To tackle these issues, we propose a novel end-to-end framework named aLSTMs, an attention-based LSTM model with semantic consistency, to transfer videos to natural sentences. This framework integrates attention mechanism with LSTM to capture salient structures of video, and explores the correlation between multimodal representations (i.e., words and visual content) for generating sentences with rich semantic content. Specifically, we first propose an attention mechanism that uses the dynamic weighted sum of local two-dimensional convolutional neural network representations. Then, an LSTM decoder takes these visual features at time t and the word-embedding feature at time t-1 to generate important words. Finally, we use multimodal embedding to map the visual and sentence features into a joint space to guarantee the semantic consistence of the sentence description and the video visual content. Experiments on the benchmark datasets demonstrate that our method using single feature can achieve competitive or even better results than the state-of-the-art baselines for video captioning in both BLEU and METEOR.

computer science, information systems,telecommunications, software engineering
Fully Convolutional Video Captioning with Coarse-to-Fine and Inherited Attention

Kuncheng Fang,Lian Zhou,Cheng Jin,Yuejie Zhang,Kangnian Weng,Tao Zhang,Weiguo Fan

DOI: https://doi.org/10.1609/aaai.v33i01.33018271

2019-01-01

Abstract:Automatically generating natural language description for video is an extremely complicated and challenging task. To tackle the obstacles of traditional LSTM-based model for video captioning, we propose a novel architecture to generate the optimal descriptions for videos, which focuses on constructing a new network structure that can generate sentences superior to the basic model with LSTM, and establishing special attention mechanisms that can provide more useful visual information for caption generation. This scheme discards the traditional LSTM, and exploits the fully convolutional network with coarse-to-fine and inherited attention designed according to the characteristics of fully convolutional structure. Our model cannot only outperform the basic LSTM-based model, but also achieve the comparable performance with those of state-of-the-art methods.
Recurrent Image Captioner: Describing Images with Spatial-Invariant Transformation and Attention Filtering

Hao Liu,Yang Yang,Fumin Shen,Lixin Duan,Heng Tao Shen

DOI: https://doi.org/10.48550/arXiv.1612.04949

2016-12-15

Computer Vision and Pattern Recognition

Abstract:Along with the prosperity of recurrent neural network in modelling sequential data and the power of attention mechanism in automatically identify salient information, image captioning, a.k.a., image description, has been remarkably advanced in recent years. Nonetheless, most existing paradigms may suffer from the deficiency of invariance to images with different scaling, rotation, etc.; and effective integration of standalone attention to form a holistic end-to-end system. In this paper, we propose a novel image captioning architecture, termed Recurrent Image Captioner (\textbf{RIC}), which allows visual encoder and language decoder to coherently cooperate in a recurrent manner. Specifically, we first equip CNN-based visual encoder with a differentiable layer to enable spatially invariant transformation of visual signals. Moreover, we deploy an attention filter module (differentiable) between encoder and decoder to dynamically determine salient visual parts. We also employ bidirectional LSTM to preprocess sentences for generating better textual representations. Besides, we propose to exploit variational inference to optimize the whole architecture. Extensive experimental results on three benchmark datasets (i.e., Flickr8k, Flickr30k and MS COCO) demonstrate the superiority of our proposed architecture as compared to most of the state-of-the-art methods.
Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning

Jingkuan Song,Zhao Guo,Lianli Gao,Wu Liu,Dongxiang Zhang,Heng Tao Shen

DOI: https://doi.org/10.48550/arXiv.1706.01231

2017-06-05

Abstract:Recent progress has been made in using attention based encoder-decoder framework for video captioning. However, most existing decoders apply the attention mechanism to every generated word including both visual words (e.g., "gun" and "shooting") and non-visual words (e.g. "the", "a"). However, these non-visual words can be easily predicted using natural language model without considering visual signals or attention. Imposing attention mechanism on non-visual words could mislead and decrease the overall performance of video captioning. To address this issue, we propose a hierarchical LSTM with adjusted temporal attention (hLSTMat) approach for video captioning. Specifically, the proposed framework utilizes the temporal attention for selecting specific frames to predict the related words, while the adjusted temporal attention is for deciding whether to depend on the visual information or the language context information. Also, a hierarchical LSTMs is designed to simultaneously consider both low-level visual information and high-level language context information to support the video caption generation. To demonstrate the effectiveness of our proposed framework, we test our method on two prevalent datasets: MSVD and MSR-VTT, and experimental results show that our approach outperforms the state-of-the-art methods on both two datasets.

Computer Vision and Pattern Recognition
Local-global Visual Interaction Attention for Image Captioning

Changzhi Wang,Xiaodong Gu

DOI: https://doi.org/10.1016/j.dsp.2022.103707

IF: 2.92

2022-01-01

Digital Signal Processing

Abstract:Image captioning is a typical cross-modal task, which aims to automatically describe the main content of an image with a complete and natural sentence. Existing attention based approaches treat local feature and global feature in the image individually, neglecting the intrinsic interaction between them that provides important guidance for generating caption. To alleviate the above issue, in this work we propose a novel Local-Global Visual Interaction Attention (LGVIA) structure that novelly explores the intrinsic interactions between local feature and global feature in the image. Specifically, we devise a new visual interaction graph network that mainly consists of visual interaction encoding module and visual interaction fusion module. The former implicitly encodes the visual relationships between local feature and global feature to obtain an enhanced visual representation containing rich local-global feature relationship. The latter fuses the previously obtained multiple relationship features to further enrich different-level relationship attribute information. In addition, we introduce a new relationship attention based LSTM module to guide the word generation by dynamically focusing on the previously output fusion relationship information. Extensive qualitative and quantitative experimental results show that the superiority of our LGVIA approach on the large-scale MSCOCO dataset. More remarkably, LGVIA outperforms the related state-of-the-art methods on the small-scale Flickr30k dataset.
CC-LSTM: Cross and Conditional Long-Short Time Memory for Video Captioning

Jiangbo Ai,Yang Yang,Xing Xu,Jie Zhou,Heng Tao Shen

DOI: https://doi.org/10.1007/978-3-030-68780-9_30

2020-01-01

Abstract:Automatically generating natural language descriptions for in-the-wild videos is a challenging task. Most recent progress in this field has been made through the combination of Convolutional Neural Networks (CNNs) and Encoder-Decoder Recurrent Neural Networks (RNNs). However, existing Encoder-Decoder RNNs framework has difficulty in capturing a large number of long-range dependencies along with the increasing of the number of LSTM units. It brings a vast information loss and leads to poor performance for our task. To explore this problem, in this paper, we propose a novel framework, namely Cross and Conditional Long Short-Term Memory (CC-LSTM). It is composed of a novel Cross Long Short-Term Memory (Cr-LSTM) for the encoding module and Conditional Long Short-Term Memory (Co-LSTM) for the decoding module. In the encoding module, the Cr-LSTM encodes the visual input into a richly informative representation by a cross-input method. In the decoding module, the Co-LSTM feeds the visual features, which is based on generated sentence and contains the global information of the visual content, into the LSTM unit as an extra visual feature. For the work of video capturing, extensive experiments are conducted on two public datasets, i.e., MSVD and MSR-VTT. Along with visualizing the results and how our model works, these experiments quantitatively demonstrate the effectiveness of the proposed CC-LSTM on translating videos to sentences with rich semantics.
Experimentelle Untersuchungen am Ctenophorenei

A. Fischel

DOI: https://doi.org/10.1007/BF02156722

1897-12-01

Archiv für Entwicklungsmechanik der Organismen

Abstract:
Image Captioning with Local-Global Visual Interaction Network.

Changzhi Wang,Xiaodong Gu

DOI: https://doi.org/10.1007/978-981-99-1645-0_38

2022-01-01

Abstract:Existing attention based image captioning approaches treat local feature and global feature in the image individually, neglecting the intrinsic interaction between them that provides important guidance for generating caption. To alleviate above issue, in this paper we propose a novel Local-Global Visual Interaction Network (LGVIN) that novelly explores the interactions between local feature and global feature. Specifically, we devise a new visual interaction graph network that mainly consists of visual interaction encoding module and visual interaction fusion module. The former implicitly encodes the visual relationships between local feature and global feature to obtain an enhanced visual representation containing rich local-global feature relationship. The latter fuses the previously obtained multiple relationship features to further enrich different-level relationship attribute information. In addition, we introduce a new relationship attention based LSTM module to guide the word generation by dynamically focusing on the previously output fusion relationship information. Extensive experimental results show that the superiority of our LGVIN approach, and our model obviously outperforms the current similar relationship based image captioning methods.
Multimodal Semantic Attention Network for Video Captioning

Liang Sun,Bing Li,Chunfeng Yuan,Zhengjun Zha,Weiming Hu

DOI: https://doi.org/10.1109/icme.2019.00226

2019-01-01

Abstract:Inspired by the fact that different modalities in videos carry complementary information, we propose a Multimodal Semantic Attention Network(MSAN), which is a new encoder-decoder framework incorporating multimodal semantic attributes for video captioning. In the encoding phase, we detect and generate multimodal semantic attributes by formulating it as a multi-label classification problem. Moreover, we add auxiliary classification loss to our model that can obtain more effective visual features and high-level multimodal semantic attribute distributions for sufficient video encoding. In the decoding phase, we extend each weight matrix of the conventional LSTM to an ensemble of attribute-dependent weight matrices, and employ attention mechanism to pay attention to different attributes at each time of the captioning process. We evaluate algorithm on two popular public benchmarks: MSVD and MSR-VTT, achieving competitive results with current state-of-the-art across six evaluation metrics.
Dual-Stream Recurrent Neural Network for Video Captioning

Ning Xu,An-An Liu,Yongkang Wong,Yongdong Zhang,Weizhi Nie,Yuting Su,Mohan Kankanhalli

DOI: https://doi.org/10.1109/tcsvt.2018.2867286

IF: 5.859

2019-01-01

IEEE Transactions on Circuits and Systems for Video Technology

Abstract:Recent progress in using recurrent neural networks (RNNs) for video description has attracted an increasing interest, due to its capability to encode a sequence of frames for caption generation. While existing methods have studied various features (e.g., CNN, 3D CNN, and semantic attributes) for visual encoding, the representation and fusion of heterogeneous information from multi-modal spaces have not fully explored. Consider that different modalities are often asynchronous, frame-level multi-modal fusion (e.g., concatenation and linear fusion) will negatively influence each modality. In this paper, we propose a dual-stream RNN (DS-RNN) framework to jointly discover and integrate the hidden states of both visual and semantic streams for video caption generation. First, an encoding RNN is used for each stream to flexibly exploit the hidden states of respective modality. Specifically, we proposed an attentive multi-grained encoder module to enhance the local feature learning with global semantics feature. Then, a dual-stream decoder is deployed to integrate the asynchronous yet complementary sequential hidden states from both streams for caption generation. Extensive experiments on three benchmark datasets, namely, MSVD, MSR-VTT, and MPII-MD, show that DS-RNN achieves competitive performance against the state-of-the-art. Additional ablation studies were conducted on various variants of the proposed DS-RNN.
SBAT: Video Captioning with Sparse Boundary-Aware Transformer

Tao Jin,Siyu Huang,Ming Chen,Yingming Li,Zhongfei Zhang

DOI: https://doi.org/10.24963/ijcai.2020/88

2020-01-01

Abstract:In this paper, we focus on the problem of applying the transformer structure to video captioning effectively. The vanilla transformer is proposed for uni-modal language generation task such as machine translation. However, video captioning is a multimodal learning problem, and the video features have much redundancy between different time steps. Based on these concerns, we propose a novel method called sparse boundary-aware transformer (SBAT) to reduce the redundancy in video representation. SBAT employs boundary-aware pooling operation for scores from multihead attention and selects diverse features from different scenarios. Also, SBAT includes a local correlation scheme to compensate for the local information loss brought by sparse operation. Based on SBAT, we further propose an aligned cross-modal encoding scheme to boost the multimodal interaction. Experimental results on two benchmark datasets show that SBAT outperforms the state-of-the-art methods under most of the metrics.
Hierarchical LSTMs with Adaptive Attention for Visual Captioning

Jingkuan Song,Xiangpeng Li,Lianli Gao,Heng Tao Shen

DOI: https://doi.org/10.48550/arXiv.1812.11004

2018-12-26

Abstract:Recent progress has been made in using attention based encoder-decoder framework for image and video captioning. Most existing decoders apply the attention mechanism to every generated word including both visual words (e.g., "gun" and "shooting") and non-visual words (e.g. "the", "a"). However, these non-visual words can be easily predicted using natural language model without considering visual signals or attention. Imposing attention mechanism on non-visual words could mislead and decrease the overall performance of visual captioning. Furthermore, the hierarchy of LSTMs enables more complex representation of visual data, capturing information at different scales. To address these issues, we propose a hierarchical LSTM with adaptive attention (hLSTMat) approach for image and video captioning. Specifically, the proposed framework utilizes the spatial or temporal attention for selecting specific regions or frames to predict the related words, while the adaptive attention is for deciding whether to depend on the visual information or the language context information. Also, a hierarchical LSTMs is designed to simultaneously consider both low-level visual information and high-level language context information to support the caption generation. We initially design our hLSTMat for video captioning task. Then, we further refine it and apply it to image captioning task. To demonstrate the effectiveness of our proposed framework, we test our method on both video and image captioning tasks. Experimental results show that our approach achieves the state-of-the-art performance for most of the evaluation metrics on both tasks. The effect of important components is also well exploited in the ablation study.

Computer Vision and Pattern Recognition
Fused GRU with Semantic-Temporal Attention for Video Captioning.

Lianli Gao,Xuanhan Wang,Jingkuan Song,Yang Liu

DOI: https://doi.org/10.1016/j.neucom.2018.06.096

IF: 6

2020-01-01

Neurocomputing

Abstract:The encoder-decoder framework has been widely used for video captioning to achieve promising results, and various attention mechanisms are proposed to further improve the performance. While temporal attention determines where to look, semantic decides the context. However, the combination of semantic and temporal attention has never be exploited for video captioning. To tackle this issue, we propose an end-to-end pipeline named Fused GRU with Semantic-Temporal Attention (STA-FG), which can explicitly incorporate the high-level visual concepts to the generation of semantic-temporal attention for video captioning. The encoder network aims to extract visual features from the videos and predict their semantic concepts, while the decoder network is focusing on efficiently generating coherent sentences using both visual features and semantic concepts. Specifically, the decoder combines both visual and semantic representation, and incorporates a semantic and temporal attention mechanism in a fused GRU network to accurately learn the sentences for video captioning. We experimentally evaluate our approach on the two prevalent datasets MSVD and MSR-VTT, and the results show that our STA-FG achieves the currently best performance on both BLEU and METEOR.
Video Captioning With Temporal And Region Graph Convolution Network

Xinlong Xiao,Yuejie Zhang,Rui Feng,Tao Zhang,Shang Gao,Weiguo Fan

DOI: https://doi.org/10.1109/ICME46284.2020.9102967

2020-01-01

Abstract:Video captioning aims to generate a natural language description for a given video clip that includes not only spatial information but also temporal information. To better exploit such spatial-temporal information attached to videos, we propose a novel video captioning framework with Temporal Graph Network (TGN) and Region Graph Network (RGN). TGN mainly focuses on utilizing the sequential information of frames that most of existing methods ignore. RGN is designed to explore the relationships among salient objects. Different from previous work, we introduce Graph Convolution Network (GCN) to encode frames with their sequential information and build a region graph for utilizing object information. We also particularly adopt a stack GRU decoder with a coarse-to-fine structure for caption generation. Very promising experimental results on two benchmark datasets (MSVD and MSR-VTT) show the effectiveness of our model.
Spatio-Temporal Ranked-Attention Networks for Video Captioning

Anoop Cherian,Jue Wang,Chiori Hori,Tim K. Marks

2020-01-17

Abstract:Generating video descriptions automatically is a challenging task that involves a complex interplay between spatio-temporal visual features and language models. Given that videos consist of spatial (frame-level) features and their temporal evolutions, an effective captioning model should be able to attend to these different cues selectively. To this end, we propose a Spatio-Temporal and Temporo-Spatial (STaTS) attention model which, conditioned on the language state, hierarchically combines spatial and temporal attention to videos in two different orders: (i) a spatio-temporal (ST) sub-model, which first attends to regions that have temporal evolution, then temporally pools the features from these regions; and (ii) a temporo-spatial (TS) sub-model, which first decides a single frame to attend to, then applies spatial attention within that frame. We propose a novel LSTM-based temporal ranking function, which we call ranked attention, for the ST model to capture action dynamics. Our entire framework is trained end-to-end. We provide experiments on two benchmark datasets: MSVD and MSR-VTT. Our results demonstrate the synergy between the ST and TS modules, outperforming recent state-of-the-art methods.

Computer Science
Motion Guided Spatial Attention for Video Captioning.

Shaoxiang Chen,Yu-Gang Jiang

DOI: https://doi.org/10.1609/aaai.v33i01.33018191

2019-01-01

Proceedings of the AAAI Conference on Artificial Intelligence

Abstract:Sequence-to-sequence models incorporated with attention mechanism have shown promising improvements on video captioning. While there is rich information both inside and between frames, spatial attention is rarely explored and motion information is usually handled by 3D-CNNs as just another modality for fusion. On the other hand, researches about human perception suggest that apparent motion can attract attention. Motivated by this, we aim to learn spatial attention on video frames under the guidance of motion information for caption generation. We present a novel video captioning framework by utilizing Motion Guided Spatial Attention (MGSA). The proposed MGSA exploits the motion between video frames by learning spatial attention from stacked optical flow images with a custom CNN. To further relate the spatial attention maps of video frames, we designed a Gated Attention Recurrent Unit (GARU) to adaptively incorporate previous attention maps. The whole framework can be trained in an end-to-end manner. We evaluate our approach on two benchmark datasets, MSVD and MSR-VTT. The experiments show that our designed model can generate better video representation and state of the art results are obtained under popular evaluation metrics such as BLEU@4, CIDEr, and METEOR.

Recurrent convolutional video captioning with global and local attention.

Learning Multimodal Attention LSTM Networks for Video Captioning.

Image Caption with Global-Local Attention

Video Captioning Using Global-Local Representation

Divided Caption Model with Global Attention

Video Captioning With Attention-Based LSTM and Semantic Consistency

Fully Convolutional Video Captioning with Coarse-to-Fine and Inherited Attention

Recurrent Image Captioner: Describing Images with Spatial-Invariant Transformation and Attention Filtering

Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning

Local-global Visual Interaction Attention for Image Captioning

CC-LSTM: Cross and Conditional Long-Short Time Memory for Video Captioning

Experimentelle Untersuchungen am Ctenophorenei

Image Captioning with Local-Global Visual Interaction Network.

Multimodal Semantic Attention Network for Video Captioning

Dual-Stream Recurrent Neural Network for Video Captioning

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

Hierarchical LSTMs with Adaptive Attention for Visual Captioning

Fused GRU with Semantic-Temporal Attention for Video Captioning.

Video Captioning With Temporal And Region Graph Convolution Network

Spatio-Temporal Ranked-Attention Networks for Video Captioning

Motion Guided Spatial Attention for Video Captioning.