Abstract:Recognizing and comprehending human actions and gestures is a crucial perception requirement for robots to interact with humans and carry out tasks in diverse domains, including service robotics, healthcare, and manufacturing. Event cameras, with their ability to capture fast-moving objects at a high temporal resolution, offer new opportunities compared to standard action recognition in RGB videos. However, previous research on event camera action recognition has primarily focused on sensor-specific network architectures and image encoding, which may not be suitable for new sensors and limit the use of recent advancements in transformer-based architectures. In this study, we employ a computationally efficient model, namely the video transformer network (VTN), which initially acquires spatial embeddings per event-frame and then utilizes a temporal self-attention mechanism. In order to better adopt the VTN for the sparse and fine-grained nature of event data, we design Event-Contrastive Loss ($\mathcal{L}_{EC}$) and event-specific augmentations. Proposed $\mathcal{L}_{EC}$ promotes learning fine-grained spatial cues in the spatial backbone of VTN by contrasting temporally misaligned frames. We evaluate our method on real-world action recognition of N-EPIC Kitchens dataset, and achieve state-of-the-art results on both protocols - testing in seen kitchen (\textbf{74.9\%} accuracy) and testing in unseen kitchens (\textbf{42.43\% and 46.66\% Accuracy}). Our approach also takes less computation time compared to competitive prior approaches, which demonstrates the potential of our framework \textit{EventTransAct} for real-world applications of event-camera based action recognition. Project Page: \url{<a class="link-external link-https" href="https://tristandb8.github.io/EventTransAct_webpage/" rel="external noopener nofollow">this https URL</a>}

An Evaluation of Action Recognition Models on EPIC-Kitchens

The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines

A Study on Differentiable Logic and LLMs for EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition 2023

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

Symbiotic Attention: UTS-Baidu Submission to the EPIC-Kitchens 2020 Action Recognition Challenge

EventTransAct: A video transformer-based framework for Event-camera based action recognition

Team PyKale (xy9) Submission to the EPIC-Kitchens 2021 Unsupervised Domain Adaptation Challenge for Action Recognition

Team VI-I2R Technical Report on EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition 2021

EPIC-KITCHENS VISOR Benchmark: VIdeo Segmentations and Object Relations

Baidu-UTS Submission to the EPIC-Kitchens Action Recognition Challenge 2019

ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition

Accuracy and Performance Comparison of Video Action Recognition Approaches

Egocentric Action Recognition by Video Attention and Temporal Context

Action Recognition and Benchmark Using Event Cameras.

TransAction: ICL-SJTU Submission to EPIC-Kitchens Action Anticipation Challenge 2021

Object Aware Egocentric Online Action Detection

Are current long-term video understanding datasets long-term?

KITchen: A Real-World Benchmark and Dataset for 6D Object Pose Estimation in Kitchen Environments

A Benchmark Dataset and Comparison Study for Multi-Modal Human Action Analytics

Efficient Human Vision Inspired Action Recognition using Adaptive Spatiotemporal Sampling

DailyDVS-200: A Comprehensive Benchmark Dataset for Event-Based Action Recognition