U-Transformer-based Multi-Levels Refinement for Weakly Supervised Action Segmentation

Xiao Ke,Xin Miao,Wenzhong Guo
DOI: https://doi.org/10.1016/j.patcog.2023.110199
IF: 8
2024-01-01
Pattern Recognition
Abstract:Action segmentation is a research hotspot in human action analysis, which aims to split videos into segments of different actions. Recent algorithms have achieved great success in modeling based on temporal convolution, but these methods weight local or global timing information through additional modules, ignoring the existing long-term and short-term information connections between actions. This paper proposes a U-Transformer structure based on multi-level refinement, introduces neighborhood attention to learn the neighborhood information of adjacent frames, and aggregates video frame features to effectively process long-term sequence information. Then a loss optimization strategy is proposed to smooth the original classification effect and generate a more accurate calibration sequence by introducing a pairing similarity optimization method based on deep feature learning. In addition, we propose a timestamp supervised training method to generate complete information for actions based on pseudo-label predictions for action boundary predictions. Experiments on three challenging action segmentation datasets, 50Salads, GTEA, and Breakfast, show that our model performs state-of-the-art models, and our weakly supervised model also performs comparably to fully supervised performance.
What problem does this paper attempt to address?