Video Action Segmentation Via Contextually Refined Temporal Keypoints

Borui Jiang,Yang Jin,Zhentao Tan,Yadong Mu
DOI: https://doi.org/10.1109/iccv51070.2023.01272
2023-01-01
Abstract:Video action segmentation involves categorizing each frame or short snippet of an untrimmed video into predefined action categories. Despite notable advancements in recent years, a considerable number of current approaches still rely on frame-wise segmentation that tends to render fragmentary results. To address it, we present an innovative approach for video action segmentation, centered around contextually refined temporal keypoints. Initially, our method identifies a set of sparse, over-complete temporal keypoints through non-local visual cues, with each keypoint representing a potential action segment candidate. Subsequent enhancements to these initial keypoints are achieved through iterative refining and re-assembling operations. Driven by the notion that optimal temporal keypoints should collectively resemble the true ground-truth structurally, we introduce a module that conducts graph matching between the keypoint-derived graph and the reference graph constructed from accurate annotations. This module effectively learns structural features used to further refine the initial keypoints. Moreover, a set of predefined rules is applied to re-assemble all temporal keypoints. The unfiltered temporal keypoints, resulting from these operations, are harnessed to generate the final action segments. We extensively evaluate our method across three video benchmarks: 50salads, GTEA, and Breakfast. Our proposed approach consistently demonstrates substantial improvements over existing methods, establishing its superiority in video action segmentation. It achieves F 1@50 scores (one of the key performance metrics for this task) of 79.5%, 83.4%, and 60.5%, respectively, v.s. previous state-of-the-art 78.5%, 79.8% and 57.4%.
What problem does this paper attempt to address?