Distillation Using Oracle Queries for Transformer-based Human-Object Interaction Detection

Xian Qu,Changxing Ding,Xingao Li,Xubin Zhong,Dacheng Tao
DOI: https://doi.org/10.1109/cvpr52688.2022.01895
2022-01-01
Abstract:Transformer-based methods have achieved great success in the field of human-object interaction (HOI) detection. However, these models tend to adopt semantically ambigu-ous queries, which lowers the transformer's representation learning power. Moreover, there are a very limited num-ber of labeled human-object pairs for most images in ex-isting datasets, which constrains the transformer's set pre-diction power. To handle the first problem, we propose an efficient knowledge distillation model, named Distillation using Oracle Queries (DOQ), which shares parameters be-tween teacher and student networks. The teacher network adopts oracle queries that are semantically clear and gener-ates high-quality decoder embeddings. By mimicking both the attention maps and decoder embeddings of the teacher network, the representation learning power of the student network is significantly promoted. To address the sec-ond problem, we introduce an efficient data augmentation method, named Context-Consistent Stitching (CCS), which generates complicated images online. Each new image is obtained by stitching labeled human-object pairs cropped from multiple training images. By selecting source images with similar context, the new synthesized image is made visually realistic. Our methods significantly promote both the accuracy and training efficiency of transformer-based HOI detection models. Experimental results show that our proposed approach consistently outperforms state-of-the-art methods on three benchmarks: HICO-DET, HOI-A, and V-COCO. Code is available at ht tps: / / gi thub. com/ SherlockHolmes221/DOQ.
What problem does this paper attempt to address?