Convolutional Transformer with Similarity-based Boundary Prediction for Action Segmentation.

Dazhao Du,Bing Su,Yu Li,Zhongang Qi,Lingyu Si,Ying Shan
DOI: https://doi.org/10.1109/ictai56018.2022.00131
2022-01-01
Abstract:Action classification has made great progress, but segmenting and recognizing actions from long videos remains a challenging problem. Recently, Transformer-based models with strong sequence modeling ability have succeeded in many se-quence modeling tasks. However, the lack of inductive bias and the difficulty of handling long video sequences limit the application of the Transformer in the action segmentation task. In order to explore the potential of the Transformer in this task, we replace some specific linear layers in the vanilla Transformer with dilated temporal convolution, and a sparse attention mechanism is utilized to reduce the time and space complexities to process long video sequences. Besides, directly using frame-wise classification loss to train the model will cause that frames at boundaries of actions are treated equally with those in the middle of actions, and the learned features are not sensitive to boundaries. We propose a new local log-context attention module to predict whether each frame is at the beginning, middle, or end of an action. Since boundary frames are similar to their neighboring frames of different classes, our similarity-based boundary prediction helps learn more discriminative features. Extensive experiments on three datasets show the effectiveness of our method.
What problem does this paper attempt to address?