Abstract:The video captioning task requires the model to have both good visual comprehension and text generation ability. Feature extractors of existing model encoders are usually pretrained on recognition and detection tasks, which do not match the data of video captions. Moreover, these methods often employ multiple 2D/3D feature extractors, which also slows down the speed of inference. On the other hand, due to the small size of the captioning data set, it is often difficult for the model to learn enough linguistic ability. In this paper, we propose a new video captioning method. Unlike the previous approach, we only use one visual transformer as the feature extractor and build the time dimension relationship of the 2D frame-level features by re-encoding. The simple encoder design greatly improves the speed of inference. At the same time, we use the generative pretraining language model GPT as our decoder, and introduce the gra-former method, effectively avoiding the problem of pre-training knowledge forgetting caused by the insertion of cross-attention layer so that we can integrate rich language knowledge into the caption model. Through this method, the newly generated words will be better able to draw cues from previous comments. Experimental evaluation of two benchmarks, MSVD and MSR-VTT, shows that the proposed method achieves the most advanced performance. In addition, ablation studies and visualization demonstrate the effectiveness of our approach.

Decoder : Generate rewritten-utterance Encoder : Last Utterance RepresentationEncoder : Context Representation

Unsupervised Context Rewriting for Open Domain Conversation

Using Context-to-Vector with Graph Retrofitting to Improve Word Embeddings

Learning to Decode for Future Success

RECAP: Retrieval-Enhanced Context-Aware Prefix Encoder for Personalized Dialogue Response Generation

DLCEncDec : A Fully Character-Level Encoder-Decoder Model for Neural Responding Conversation.

Decoupled Context Processing for Context Augmented Language Modeling

Context-aware Code Generation with Synchronous Bidirectional Decoder

RAVEN: In-Context Learning with Retrieval-Augmented Encoder-Decoder Language Models

Neural Contextual Conversation Learning with Labeled Question-Answering Pairs

Improving Retrieval Augmented Open-Domain Question-Answering with Vectorized Contexts

X-RECOSA: Multi-Scale Context Aggregation for Multi-Turn Dialogue Generation

ConRPG: Paraphrase Generation using Contexts as Regularizer

Learning Context-Specific Word/Character Embeddings.

MEMD: A Diversity-Promoting Learning Framework for Short-Text Conversation.

Self Attention Re-encoding and Linguistic Ability Preserving for Context-Aware Video Captioning

How to Represent Context Better? an Empirical Study on Context Modeling for Multi-turn Response Selection.

Long-Context Language Modeling with Parallel Context Encoding

How To Make Context More Useful? An Empirical Study On Context-Aware Neural Conversational Models

Query and Output: Generating Words by Querying Distributed Word Representations for Paraphrase Generation