Bottom-Up Transformer Reasoning Network for Text-Image Retrieval.

Zonghao Yang,Yue Zhou,Ao Chen
DOI: https://doi.org/10.1007/978-981-99-1645-0_15
2022-01-01
Abstract:Image-text retrieval is a complicated and challenging task in the cross-modality area, and lots of experiments have made great progress. Most existing researches process images and text in one pipeline or are highly entangled, which is not practical and human-friendly in the real world. Moreover, the image regions extracted by Faster-RCNN are highly over-sampled in the image pipeline, which causes ambiguities for the extracted visual embeddings. From this point of view, we introduce the Bottom-up Transformer Reasoning Network (BTRN). Our method is built upon the transformer encoders to process the image and text separately. We also embed the tag information generated by Faster-RCNN to strengthen the connection between the two modalities. Recall at K and normalized discounted cumulative gain metric (NDCG) metrics are used to evaluate our model. Through various experiments, we prove our model can reach state-of-the-art results.
What problem does this paper attempt to address?