Multi-semantic long-range dependencies capturing for efficient video representation learning

Jinhao Duan,Hua Xu,Xiaozhu Lin,Shangchao Zhu,Yuanze Du
DOI: https://doi.org/10.1016/j.imavis.2020.103988
IF: 3.86
2020-12-01
Image and Vision Computing
Abstract:<p>Capturing long-range dependencies has proven effective on video understanding tasks. However, previous works address this problem in a pixel pairs manner, which might be inaccurate since pixel pairs contain too limited semantic information. Besides, considerable computations and parameters will be introduced in those methods. Following the pattern of features aggregation in Graph Convolutional Networks (GCNs), we aggregate pixels with their neighbors into semantic units, which contain stronger semantic information than pixel pairs. We designed an efficient, parameter-free, semantic units-based dependencies capturing framework, named as Multi-semantic Long-range Dependencies Capturing (MLDC) block. We verified our methods on large-scale challenging video classification benchmark, such as Kinetics. Experiments demonstrate that our method highly outperforms pixel pairs-based methods and achieves the state-of-the-art performance, without introducing any parameters and much computations.</p>
computer science, artificial intelligence, theory & methods,engineering, electrical & electronic, software engineering,optics
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper aims to address the issue of capturing long-range dependencies in video representation learning. Specifically: 1. **Problems with Existing Methods**: - Current methods typically model long-range dependencies through pixel pairs, which may not be precise enough as pixel pairs contain limited information. - Pixel-based methods require a large amount of computational resources and parameters, leading to inefficiency. 2. **Proposed New Method**: - The paper proposes a new framework called the Multi-semantic Long-range Dependencies Capturing (MLDC) block. - This framework captures more reliable long-range dependencies by aggregating adjacent pixels into semantic units. - The MLDC block not only improves the ability to capture dependencies but also does not introduce additional parameters and extensive computation. Through this method, the paper validates its effectiveness on large-scale video classification benchmarks (such as the Kinetics dataset) and demonstrates significantly better performance compared to pixel-pair-based methods and other mainstream 2D/3D networks.