Unified Referring Expression Generation for Bounding Boxes and Segmentations

Zongtao Liu,Tianyang Xu,Xiaoning Song,Xiao-Jun Wu
DOI: https://doi.org/10.1109/lsp.2024.3363647
2024-03-02
IEEE Signal Processing Letters
Abstract:Referring expression generation (REG) is a challenging task at the intersection of computer vision and natural language processing, which aims at generating natural language descriptions that uniquely refer to a specific object within an image. Existing REG approaches solely utilize bounding boxes in a rather primitive manner to specify target objects, and employ the classical Convolutional Neural Networks (CNNs) for image encoding, followed by recurrent layers for text generation. In this letter, we propose a novel end-to-end REG model. Our model highlights the target using bounding boxes and segmentations in a unified fashion. Specifically, we propose two settings for utilizing these signals: employing them as inputs to the model and as supervision signals for pre-training tasks. Additionally, we harness the power of the recently prevailed self-attention architecture to bridge targeted visual clues and text correspondence. During inference, our method achieves state-of-the-art performance in a one-stage manner, reflecting the potential of both bounding boxes and segmentation references in constructing REG solutions.
engineering, electrical & electronic
What problem does this paper attempt to address?