Abstract:Abstract Nowadays, Artificial Intelligence Generated Content (AIGC) has shown promising prospects in both computer vision and natural language processing communities. Meanwhile, as an essential aspect of AIGC, image to captions has received much more attention. Recent vision-language research is developing from the bulky region visual representations based on object detectors toward more convenient and flexible grid ones. However, this kind of research typically concentrates on image understanding tasks like image classification, with less attention paid to content generation tasks. In this paper, we explore how to capitalize on the expressive features embedded in the grid visual representations for better image captioning. To this end, we present a Transformer-based image captioning model, dubbed FeiM, with two straightforward yet effective designs. We first design the feature queries that consist of a limited set of learnable vectors, which act as the local signals to capture specific visual information from global grid features. Then, taking augmented global grid features and the local feature queries as inputs, we develop a feature interaction module to query relevant visual concepts from grid features, and to enable interaction between the local signal and overall context. Finally, the refined grid visual representations and the linguistic features pass through a Transformer architecture for multi-modal fusion. With the two novel and simple designs, FeiM can fully leverage meaningful visual knowledge to improve image captioning performance. Extensive experiments are performed on the competitive MSCOCO benchmark to confirm the effectiveness of the proposed approach, and the results show that FeiM yields more eminent results than existing advanced captioning models.

Image captioning for Brazilian Portuguese using GRIT model

GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features

Transformer Models for Brazilian Portuguese Question Generation: An Experimental Study

#PraCegoVer: A Large Dataset for Image Captioning in Portuguese

Large Language Models for Captioning and Retrieving Remote Sensing Images

Advancing Neural Encoding of Portuguese with Transformer Albertina PT-*

GRPIC: an end-to-end image captioning model using three visual features

From Brazilian Portuguese to European Portuguese

Image Captioning: Transforming Objects into Words

Fostering the Ecosystem of Open Neural Encoders for Portuguese with Albertina PT* Family

VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

Embedding generation for text classification of Brazilian Portuguese user reviews: from bag-of-words to transformers

PeLLE: Encoder-based language models for Brazilian Portuguese based on open data

Evaluating GPT-4's Vision Capabilities on Brazilian University Admission Exams

GRiT: A Generative Region-to-text Transformer for Object Understanding

GlórIA -- A Generative and Open Large Language Model for Portuguese

Portuguese Named Entity Recognition using BERT-CRF

Exploring better image captioning with grid features

PORTULAN ExtraGLUE Datasets and Models: Kick-starting a Benchmark for the Neural Processing of Portuguese

Improving Portuguese Semantic Role Labeling with Transformers and Transfer Learning

Multilingual Vision-Language Pre-training for the Remote Sensing Domain