Unsupervised Domain Adaptation for Video Object Grounding with Cascaded Debiasing Learning

Mengze Li,Haoyu Zhang,Juncheng Li,Zhou Zhao,Wenqiao Zhang,Shengyu Zhang,Shiliang Pu,Yueting Zhuang,Fei Wu
DOI: https://doi.org/10.1145/3581783.3612314
2023-01-01
Abstract:This paper addresses the Unsupervised Domain Adaptation (UDA) for the dense frame prediction task - Video Object Grounding (VOG). This investigation springs from the recognition of the limited generalization capabilities of data-driven approaches when confronted with unseen test scenarios. We set the goal of enhancing the adaptability of the source-dominated model from a labeled domain to the unlabeled target domain through re-training on pseudo-labels (i.e., predicted boxes of language-described objects). Given the potential for source-domain biases in the pseudo-label generation, we decompose the labeling refinement as two cascaded debiasing subroutines: (1) we develop a discarded training strategy to correct the Biased Proposal Selection by filtering out the examples with uncertain proposals selected from the proposal (candidate box) set. The identifier of these uncertain examples is the discordance between the predictions of the source-dominated model and those of a target-domain clustered classifier, which remains free from the source-domain bias. (2) With the refined proposals as a foundation, we measure Grounding Coordinate Offset based on the semantic distance of the model's prediction across domains, based on which we alleviate source-domain bias in the target model through adversarial learning. To verify the superiority of the proposed method, we collected two UDA-VOG datasets called I2O-VOG and R2M-VOG by manually dividing and combining the well-known VOG datasets. The extensive experiments on them show our model significantly outperforms SOTA methods by a large margin.
What problem does this paper attempt to address?