Abstract:Action recognition has witnessed the development of a growing number of novel algorithms and datasets in the past decade. However, the majority of public benchmarks were constructed around activities of daily living and annotated at a rather coarse-grained level, which lacks diversity in domain-specific datasets, especially for rarely seen domains. In this paper, we introduced Human Stone Toolmaking Action Grammar (HSTAG), a meticulously annotated video dataset showcasing previously undocumented stone toolmaking behaviors, which can be used for investigating the applications of advanced artificial intelligence techniques in understanding a rapid succession of complex interactions between two hand-held objects. HSTAG consists of 18,739 video clips that record 4.5 hours of experts' activities in stone toolmaking. Its unique features include (i) brief action durations and frequent transitions, mirroring the rapid changes inherent in many motor behaviors; (ii) multiple angles of view and switches among multiple tools, increasing intra-class variability; (iii) unbalanced class distributions and high similarity among different action sequences, adding difficulty in capturing distinct patterns for each action. Several mainstream action recognition models are used to conduct experimental analysis, which showcases the challenges and uniqueness of HSTAG <a class="link-external link-https" href="https://nyu.databrary.org/volume/1697" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is **the challenges of fine - grained action behavior recognition, especially its application in the special field of stone tool making**. Specifically, the author introduced a new dataset named Human Stone Toolmaking Action Grammar (HSTAG), aiming to provide a challenging benchmarking platform for complex and rapidly changing hand movements, so as to promote the understanding of the evolution of ancient human technology. ### Main Problems and Challenges 1. **Limitations of Existing Datasets**: - Most publicly available datasets mainly focus on daily - life activities, and the annotations are relatively rough and lack diversity. - There is a lack of datasets in specific fields (such as stone tool making), especially high - quality datasets in rare fields. 2. **Uniqueness of Stone Tool - making Actions**: - Stone tool making involves a series of complex interactive actions, which frequently change in a short time. - The high similarity between actions increases the difficulty of distinguishing different action categories. - There is a class - imbalance problem in the dataset, with fewer samples for some actions. 3. **Challenges in Visual Understanding**: - Rapid action transitions and high - frequency actions increase the difficulty of finding key feature representations. - Hand actions usually occupy a small area in the frame, and background noise can interfere with the learning process. - The unbalanced class distribution makes it difficult for some classes to obtain good feature representations. ### Solutions To address the above challenges, the author constructed the HSTAG dataset, which contains 18,739 video clips, recording 4.5 hours of expert stone tool - making activities. The main features of the HSTAG dataset include: - **Multi - view Recording**: Each action category contains video clips recorded from the top - view and the front - view, increasing the intra - class variability. - **Tool Switching**: Using different types of tools (such as stones and antlers) for the same action reduces the intra - class similarity. - **Short Action Sequences and High - Frequency Action Transitions**: Some actions are short - lived, and the transitions between actions are fast, increasing the difficulty of capturing the main features. ### Experimental Analysis The author used several mainstream action recognition models (VideoMAEv2, TimeSformer, ResNet+GRU) to conduct experimental analysis on the HSTAG dataset. The results show that: - The VideoMAE model performs best in terms of overall accuracy and F1 - macro average. - The class - imbalance problem is one of the biggest challenges faced by all models, especially performing poorly in Tool Change and Grinding action classification. - Through t - SNE visualization, the VideoMAE model shows better class separation in the embedding space. ### Summary The HSTAG dataset not only enriches the benchmarking in specific fields but also promotes the development of new computer vision algorithms, especially in dealing with class - imbalance and high - frequency action transitions. This provides new tools and methods for studying the evolution of ancient human technology and language. --- If you have more specific questions or need further information, please feel free to let me know!

Human Stone Toolmaking Action Grammar (HSTAG): A Challenging Benchmark for Fine-grained Motor Behavior Recognition

HabitAction: A Video Dataset for Human Habitual Behavior Recognition

FineGym: A Hierarchical Video Dataset for Fine-Grained Action Understanding

Human Action Recognition Using Deep Learning Methods.

Multi-Granularity Hand Action Detection

Action Genome: Actions as Composition of Spatio-temporal Scene Graphs

Action Recognition by Exploring Data Distribution and Feature Correlation

Indications, complications and results with silicone stents.

ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition

TACO: Benchmarking Generalizable Bimanual Tool-ACtion-Object Understanding

FHA-Kitchens: A Novel Dataset for Fine-Grained Hand Action Recognition in Kitchen Scenes

Learning Latent Spatio-Temporal Compositional Model for Human Action Recognition

Representing Videos As Discriminative Sub-graphs for Action Recognition*

Human-Activity AGV Quality Assessment: A Benchmark Dataset and an Objective Evaluation Metric

Spatial-temporal hypergraph based on dual-stage attention network for multi-view data lightweight action recognition

Benchmarking Micro-action Recognition: Dataset, Methods, and Applications

HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization

STAIR Actions: A Video Dataset of Everyday Home Actions

STGauntlet - Recognizing Hand Gestures over Multiple Hand-Worn Motion Sensors.

Action Recognition Utilizing YGAR Dataset

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition