Adaptively Clustering Neighbor Elements for Image-Text Generation

Zihua Wang,Xu Yang,Hanwang Zhang,Haiyang Xu,Ming Yan,Fei Huang,Yu Zhang
2024-06-24
Abstract:We propose a novel Transformer-based image-to-text generation model termed as \textbf{ACF} that adaptively clusters vision patches into object regions and language words into phrases to implicitly learn object-phrase alignments for better visual-text coherence. To achieve this, we design a novel self-attention layer that applies self-attention over the elements in a local cluster window instead of the whole sequence. The window size is softly decided by a clustering matrix that is calculated by the current input data and thus this process is adaptive. By stacking these revised self-attention layers to construct ACF, the small clusters in the lower layers can be grouped into a bigger cluster, \eg vision/language. ACF clusters small objects/phrases into bigger ones. In this gradual clustering process, a parsing tree is generated which embeds the hierarchical knowledge of the input sequence. As a result, by using ACF to build the vision encoder and language decoder, the hierarchical object-phrase alignments are embedded and then transferred from vision to language domains in two popular image-to-text tasks: Image captioning and Visual Question Answering. The experiment results demonstrate the effectiveness of ACF, which outperforms most SOTA captioning and VQA models and achieves comparable scores compared with some large-scale pre-trained models. Our code is available \href{<a class="link-external link-https" href="https://github.com/ZihuaEvan/ACFModel/" rel="external noopener nofollow">this https URL</a>}{[here]}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?