Self-Supervised Learning from Untrimmed Videos via Hierarchical Consistency
Zhiwu Qing,Shiwei Zhang,Ziyuan Huang,Yi Xu,Xiang Wang,Changxin Gao,Rong Jin,Nong Sang
DOI: https://doi.org/10.1109/tpami.2023.3273415
IF: 23.6
2023-01-01
IEEE Transactions on Pattern Analysis and Machine Intelligence
Abstract:Natural untrimmed videos provide rich visual content for self-supervised learning. Yet most previous efforts to learn spatiotemporal representations rely on manually trimmed videos, such as Kinetics dataset [1], resulting in limited diversity in visual patterns and limited performance gains. In this work, we aim to improve video representations by leveraging the rich information in natural untrimmed videos. For this purpose, we propose learning a hierarchy of temporal consistencies in videos, i.e., visual consistency and topical consistency, corresponding respectively to clip pairs that tend to be visually similar when separated by a short time span, and clip pairs that share similar topics when separated by a long time span. Specifically, we present a Hierarchical Consistency (HiCo++) learning framework, in which the visually consistent pairs are encouraged to share the same feature representations by contrastive learning, while topically consistent pairs are coupled through a topical classifier that distinguishes whether they are topic-related, i.e., from the same untrimmed video. Additionally, we impose a gradual sampling algorithm for the proposed hierarchical consistency learning, and demonstrate its theoretical superiority. Empirically, we show that HiCo++ can not only generate stronger representations on untrimmed videos, but also improve the representation quality when applied to trimmed videos. This contrasts with standard contrastive learning, which fails to learn powerful representations from untrimmed videos. Source code will be made available here https://github.com/alibaba-mmai-research/HiCo.
computer science, artificial intelligence,engineering, electrical & electronic