Improving Fine-grained Understanding for Retrieval in Human Motion and Text

Sheng Yan,Yong Wang,Xin Du,Hongchang Jin,Mengyuan Liu
DOI: https://doi.org/10.1109/lsp.2024.3425283
2024-01-01
IEEE Signal Processing Letters
Abstract:This work focuses on human motion-text retrieval (MTR), a task recently proposed for motion understanding. Unlike traditional visual-text retrieval, human motion can be understood as the superposition of numerous atomic actions, and its description is also limited to human-centered themes. Considering this characteristic, directly mapping similar samples into a joint embedding space and conducting naive contrastive training is suboptimal, as it lacks cognition of fine-grained human language descriptions and fails to alleviate semantic conflicts between similar samples. To address this, we propose a meticulous Cross-perceptual Salience Mapping, highlighting fine-grained poses or words to provide more accurate similarity measurement. Additionally, a novel Drop-then-Contrast scheme is designed for MTR, discarding false negative samples from the negative set and mining the remaining sample for contrastive training to reduce violations they caused. Our framework, termed as improving fine-grained understanding for Retrieval in Human Motion and Text or Rehamot for short, outperforms previous works by a recall of 58.6% and 56.5% on HumanML3D and KITML respectively (motion retrieval, R@10). Our code is publicly available at https://github.com/eanson023/rehamot.
What problem does this paper attempt to address?