End to End Alignment Learning of Instructional Videos with Spatiotemporal Hybrid Encoding and Decoding Space Reduction

Lin Wang,Xingfu Wang,Ammar Hawbani,Yan Xiong
DOI: https://doi.org/10.3390/app11114954
2021-01-01
Applied Sciences
Abstract:We solve the problem of how to densely align actions in videos at frame level, with only the order of occurring actions available, in order to save the time-consuming efforts to accurately annotate the temporal boundaries of each action. We propose three task-specific innovations under this scenario: (1) To encode fine-grained spatiotemporal local features and long-range temporal patterns simultaneously, we test three popular backbones and compare their accuracy and training times: (i) a recurrent LSTM; (ii) a fully convolutional model; and (iii) the recently proposed Transformer model. (2) To address the absence of ground truth frame-by-frame labels during training, we apply connectionist temporal classification (CTC) on top of the temporal encoder to recursively collect all theoretically valid alignments, and further weight these alignments with frame-wise visual similarities, in order to avoid a significant number of degenerated paths and improve both recognition accuracy and computation efficiency. (3) To quantitatively assess the quality of the learned alignment, we apply a comprehensive set of frame-level, segment-level, and video-level evaluation measurements. Extensive evaluations verify the effectiveness of our proposal, with performance comparable to that of fully supervised approaches across four benchmarks of different difficulty and data scale.
What problem does this paper attempt to address?