Abstract:Automatic image captioning is to conduct the cross-modal conversion from image visual content to natural language text. Involving computer vision (CV) and natural language processing (NLP), it has become one of the most sophisticated research issues in the artificial-intelligence area. Based on the deep neural network, the neural image caption (NIC) model has achieved remarkable performance in image captioning, yet there still remain some essential challenges, such as the deviation between descriptive sentences generated by the model and the intrinsic content expressed by the image, the low accuracy of the image scene description, and the monotony of generated sentences. In addition, most of the current datasets and methods for image captioning are in English. However, considering the distinction between Chinese and English in syntax and semantics, it is necessary to develop specialized Chinese image caption generation methods to accommodate the difference. To solve the aforementioned problems, we design the NICVATP2L model via visual attention and topic modeling, in which the visual attention mechanism reduces the deviation and the topic model improves the accuracy and diversity of generated sentences. Specifically, in the encoding phase, convolutional neural network (CNN) and topic model are used to extract visual and topic features of the input images, respectively. In the decoding phase, an attention mechanism is applied to processing image visual features for obtaining image visual region features. Finally, the topic features and the visual region features are combined to guide the two-layer long short-term memory (LSTM) network for generating Chinese image captions. To justify our model, we have conducted experiments over the Chinese AIC-ICC image dataset. The experimental results show that our model can automatically generate more informative and descriptive captions in Chinese in a more natural way, and it outperforms the existing image captioning NIC model.

The Recognition of Chinese Caption Text in News Video Using Convolutional Neural Network

A Multi-stage Method for Chinese Text Detection in News Videos

A CNN Based Scene Chinese Text Recognition Algorithm With Synthetic Data Engine

End-to-End Subtitle Detection and Recognition for Videos in East Asian Languages via CNN Ensemble with Near-Human-Level Performance

A New Hybrid Method for Caption and Scene Text Classification in Action Video Images

News Captions Detection Based on Corner Detection and Adaptive Threshold

A New Method of News Local-caption Extraction Based on Spatio-temporal Distribution Feature

Chinese Image Text Recognition with BLSTM-CTC: A Segmentation-Free Method.

Convolutional Reconstruction-to-Sequence for Video Captioning

Chinese Image Caption Generation via Visual Attention and Topic Modeling

Recurrent Image Captioner: Describing Images with Spatial-Invariant Transformation and Attention Filtering

D-CNN: A New model for Generating Image Captions with Text Extraction Using Deep Learning for Visually Challenged Individuals

Development of a character CAPTCHA recognition system for the visually impaired community using deep learning

Chinese image caption of Inceptionv4 and double-layer GRUs based on attention mechanism

Research on captcha recognition with convolutional neural networks

Chinese Text Classification Based on Hybrid Model of CNN and LSTM

Exploiting Effective Representations for Chinese Sentiment Analysis Using a Multi-Channel Convolutional Neural Network

Recognition Confidence Analysis of Handwritten Chinese Character with CNN

Chinese image captioning with fusion encoder and visual keyword search

Enhanced Image Captioning with Color Recognition Using Deep Learning Methods

Deep Learning for Video Captioning: A Review