R el v i t: c oncept - guided v ision t ransformer for v isual r elational r easoning

Xiaojian Ma,Weili Nie,Zhiding Yu,Huaizu Jiang,Chaowei Xiao,Yuke Zhu,Song-Chun Zhu,Anima Anandkumar
2022-01-01
Abstract:our for at with two new concept-guided auxiliary tasks: 1) a global task for promoting relational reasoning, and 2) a local task for facilitating semantic object-centric correspondence learning. To examine the systematic generalization of visual reasoning models, we introduce systematic splits for the standard HICO and GQA benchmarks. We show the resulting model, Concept-guided Vision Transformer (or RelViT for short) signif-icantly outperforms prior approaches on HICO and GQA by 16% and 13% in the original split, and by 43% and 18% in the systematic split. Our ablation analyses also reveal our model’s compatibility with multiple ViT variants and robustness to hyper-parameters.
What problem does this paper attempt to address?