Abstract:Purpose The purpose of this paper is to provide a fast and accurate network for spatiotemporal action localization in videos. It detects human actions both in time and space simultaneously in real-time, which is applicable in real-world scenarios such as safety monitoring and collaborative assembly. Design/methodology/approach This paper design an end-to-end deep learning network called collaborator only watch once (COWO). COWO recognizes the ongoing human activities in real-time with enhanced accuracy. COWO inherits from the architecture of you only watch once (YOWO), known to be the best performing network for online action localization to date, but with three major structural modifications: COWO enhances the intraclass compactness and enlarges the interclass separability in the feature level. A new correlation channel fusion and attention mechanism are designed based on the Pearson correlation coefficient. Accordingly, a correction loss function is designed. This function minimizes the same class distance and enhances the intraclass compactness. Use a probabilistic K-means clustering technique for selecting the initial seed points. The idea behind this is that the initial distance between cluster centers should be as considerable as possible. CIOU regression loss function is applied instead of the Smooth L1 loss function to help the model converge stably. Findings COWO outperforms the original YOWO with improvements of frame mAP 3% and 2.1% at a speed of 35.12 fps. Compared with the two-stream, T-CNN, C3D, the improvement is about 5% and 14.5% when applied to J-HMDB-21, UCF101-24 and AGOT data sets. Originality/value COWO extends more flexibility for assembly scenarios as it perceives spatiotemporal human actions in real-time. It contributes to many real-world scenarios such as safety monitoring and collaborative assembly.

OWL (Observe, Watch, Listen): Audiovisual Temporal Context for Localizing Actions in Egocentric Videos

Egocentric Audio-Visual Object Localization

Open-Vocabulary Temporal Action Localization using Multimodal Guidance

COWO: Towards Real-Time Spatiotemporal Action Localization in Videos

Video OWL-ViT: Temporally-consistent open-world localization in video

SmallTAL: Real-Time Egocentric Online Temporal Action Localization for the Data-Impoverished

VAL: Visual-Attention Action Localizer

Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection

When Did It Happen? Duration-informed Temporal Localization of Narrated Actions in Vlogs

ContextLoc++: A Unified Context Model for Temporal Action Localization

Spatiotemporal Action Recognition in Restaurant Videos

Enriching Local and Global Contexts for Temporal Action Localization.

Single-Stage Visual Query Localization in Egocentric Videos

Localizing Unseen Activities in Video Via Image Query

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

OW-TAL: Learning Unknown Human Activities for Open-World Temporal Action Localization

DeTAL: Open-Vocabulary Temporal Action Localization with Decoupled Networks

Exploring Scalability of Self-Training for Open-Vocabulary Temporal Action Localization

Hierarchical Visual-Textual Graph for Temporal Activity Localization via Language

Weakly Supervised Temporal Action Localization Through Learning Explicit Subspaces for Action and Context.

Egocentric Auditory Attention Localization in Conversations