Human Stone Toolmaking Action Grammar (HSTAG): A Challenging Benchmark for Fine-grained Motor Behavior Recognition

Cheng Liu,Xuyang Yan,Zekun Zhang,Cheng Ding,Tianhao Zhao,Shaya Jannati,Cynthia Martinez,Dietrich Stout
2024-10-11
Abstract:Action recognition has witnessed the development of a growing number of novel algorithms and datasets in the past decade. However, the majority of public benchmarks were constructed around activities of daily living and annotated at a rather coarse-grained level, which lacks diversity in domain-specific datasets, especially for rarely seen domains. In this paper, we introduced Human Stone Toolmaking Action Grammar (HSTAG), a meticulously annotated video dataset showcasing previously undocumented stone toolmaking behaviors, which can be used for investigating the applications of advanced artificial intelligence techniques in understanding a rapid succession of complex interactions between two hand-held objects. HSTAG consists of 18,739 video clips that record 4.5 hours of experts' activities in stone toolmaking. Its unique features include (i) brief action durations and frequent transitions, mirroring the rapid changes inherent in many motor behaviors; (ii) multiple angles of view and switches among multiple tools, increasing intra-class variability; (iii) unbalanced class distributions and high similarity among different action sequences, adding difficulty in capturing distinct patterns for each action. Several mainstream action recognition models are used to conduct experimental analysis, which showcases the challenges and uniqueness of HSTAG <a class="link-external link-https" href="https://nyu.databrary.org/volume/1697" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **the challenges of fine - grained action behavior recognition, especially its application in the special field of stone tool making**. Specifically, the author introduced a new dataset named Human Stone Toolmaking Action Grammar (HSTAG), aiming to provide a challenging benchmarking platform for complex and rapidly changing hand movements, so as to promote the understanding of the evolution of ancient human technology. ### Main Problems and Challenges 1. **Limitations of Existing Datasets**: - Most publicly available datasets mainly focus on daily - life activities, and the annotations are relatively rough and lack diversity. - There is a lack of datasets in specific fields (such as stone tool making), especially high - quality datasets in rare fields. 2. **Uniqueness of Stone Tool - making Actions**: - Stone tool making involves a series of complex interactive actions, which frequently change in a short time. - The high similarity between actions increases the difficulty of distinguishing different action categories. - There is a class - imbalance problem in the dataset, with fewer samples for some actions. 3. **Challenges in Visual Understanding**: - Rapid action transitions and high - frequency actions increase the difficulty of finding key feature representations. - Hand actions usually occupy a small area in the frame, and background noise can interfere with the learning process. - The unbalanced class distribution makes it difficult for some classes to obtain good feature representations. ### Solutions To address the above challenges, the author constructed the HSTAG dataset, which contains 18,739 video clips, recording 4.5 hours of expert stone tool - making activities. The main features of the HSTAG dataset include: - **Multi - view Recording**: Each action category contains video clips recorded from the top - view and the front - view, increasing the intra - class variability. - **Tool Switching**: Using different types of tools (such as stones and antlers) for the same action reduces the intra - class similarity. - **Short Action Sequences and High - Frequency Action Transitions**: Some actions are short - lived, and the transitions between actions are fast, increasing the difficulty of capturing the main features. ### Experimental Analysis The author used several mainstream action recognition models (VideoMAEv2, TimeSformer, ResNet+GRU) to conduct experimental analysis on the HSTAG dataset. The results show that: - The VideoMAE model performs best in terms of overall accuracy and F1 - macro average. - The class - imbalance problem is one of the biggest challenges faced by all models, especially performing poorly in Tool Change and Grinding action classification. - Through t - SNE visualization, the VideoMAE model shows better class separation in the embedding space. ### Summary The HSTAG dataset not only enriches the benchmarking in specific fields but also promotes the development of new computer vision algorithms, especially in dealing with class - imbalance and high - frequency action transitions. This provides new tools and methods for studying the evolution of ancient human technology and language. --- If you have more specific questions or need further information, please feel free to let me know!