Abstract:A typical task in the field of video understanding is hand action recognition, which has a wide range of applications. Existing works either mainly focus on full-body actions, or the defined action categories are relatively coarse-grained. In this paper, we propose FHA-Kitchens, a novel dataset of fine-grained hand actions in kitchen scenes. In particular, we focus on human hand interaction regions and perform deep excavation to further refine hand action information and interaction regions. Our FHA-Kitchens dataset consists of 2,377 video clips and 30,047 images collected from 8 different types of dishes, and all hand interaction regions in each image are labeled with high-quality fine-grained action classes and bounding boxes. We represent the action information in each hand interaction region as a triplet, resulting in a total of 878 action triplets. Based on the constructed dataset, we benchmark representative action recognition and detection models on the following three tracks: (1) supervised learning for hand interaction region and object detection, (2) supervised learning for fine-grained hand action recognition, and (3) intra- and inter-class domain generalization for hand interaction region detection. The experimental results offer compelling empirical evidence that highlights the challenges inherent in fine-grained hand action recognition, while also shedding light on potential avenues for future research, particularly in relation to pre-training strategy, model design, and domain generalization. The dataset will be released at <a class="link-external link-https" href="https://github.com/tingZ123/FHA-Kitchens" rel="external noopener nofollow">this https URL</a>.

Multi-Granularity Hand Action Detection

FHA-Kitchens: A Novel Dataset for Fine-Grained Hand Action Recognition in Kitchen Scenes

MMAD: Multi-label Micro-Action Detection in Videos

FineGym: A Hierarchical Video Dataset for Fine-Grained Action Understanding

FineAction: A Fine-Grained Video Dataset for Temporal Action Localization

Multi-fingered Robotic Hand Grasping in Cluttered Environments through Hand-object Contact Semantic Mapping

Human Action Recognition Using Deep Learning Methods.

MVHANet: Multi-view Hierarchical Aggregation Network for Skeleton-Based Hand Gesture Recognition

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

A Multi-Person Video Dataset Annotation Method of Spatio-Temporally Actions

MM-SEAL: A Large-scale Video Dataset of Multi-person Multi-grained Spatio-temporally Action Localization

Fine-grained Action Analysis: A Multi-modality and Multi-task Dataset of Figure Skating

Human Stone Toolmaking Action Grammar (HSTAG): A Challenging Benchmark for Fine-grained Motor Behavior Recognition

ADL4D: Towards A Contextually Rich Dataset for 4D Activities of Daily Living

PKU-MMD: A Large Scale Benchmark for Continuous Multi-Modal Human Action Understanding.

Precise Affordance Annotation for Egocentric Action Video Datasets

Fine-grained Hand Gesture Recognition in Multi-viewpoint Hand Hygiene

CZU-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and 10 wearable inertial sensors

Action Recognition by Exploring Data Distribution and Feature Correlation

HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization

SEAL: A Large-scale Video Dataset of Multi-grained Spatio-temporally Action Localization