Abstract:Online action detection (OAD)is a challenging task that involves predicting the ongoing action class in real-time streaming videos, which is essential in the field of autonomous driving and video surveillance. In this article, we propose an approach for OAD based on the Receptance Weighted Key Value (RWKV) model with temporal label smooth. The RWKV model captures temporal dependencies and computes efficiently at the same time, which makes it well-suited for real-time applications. Our TLS-RWKV model demonstrates advancements in two aspects. First, we conducted experiments on two widely used datasets, THUMOS'14 and TVSeries. Our proposed approach demonstrates state-of-the-art performance with 71.8% mAP on THUMOS'14 and 89.7% cAP on TVSeries. Second, our proposed approach demonstrates impressive efficiency, running at over 600 FPS and maintaining a competitive mAP of 59.9% on THUMOS'14 with RGB features alone. Notably, this efficiency surpasses the prior state-of-the-art model, TesTra, by more than two times. Even when executed on a CPU, our model maintains a commendable speed, exceeding 200 FPS. This high efficiency makes our model suitable for real-time deployment, even on resource-constrained devices. These results showcase the effectiveness and competitiveness of our proposed approach in OAD.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key challenges in **Online Action Detection (OAD)**. Specifically, the author proposes a new method to achieve real - time online action detection and pays special attention to the following issues: 1. **Real - time requirements**: - Online action detection needs to predict the ongoing action category in real - time in the video stream, which places extremely high demands on computational efficiency. Especially in fields such as autonomous driving and video surveillance, real - time response is crucial. 2. **Long - term dependency capture**: - The action detection task usually needs to capture long - term context information to accurately identify the start and end of an action. However, existing models such as RNN and Transformer have limitations when dealing with long - sequence data. RNN is difficult to parallelize training and is prone to the vanishing gradient problem, while Transformer can handle long - range dependencies but has too high computational complexity in real - time applications. 3. **Label smoothing**: - In online action detection, action boundaries are often not clear enough, causing the model to be error - prone in areas close to the action boundaries. The traditional label assignment method usually assigns the label of a specific time point to the entire segment, which is not refined enough and may introduce noise. To solve these problems, the author proposes a new method based on the **RWKV model** and introduces the **Temporal Label Smoothing (TLS)** technique. This method not only improves the performance of the model but also significantly enhances the computational efficiency, enabling it to be deployed in real - time on resource - constrained devices. ### Specific contributions: - **Application of the RWKV model**: The RWKV model combines the long - range dependency capture ability of Transformer and the efficient inference characteristics of RNN, and is suitable for real - time online action detection. - **Temporal Label Smoothing (TLS)**: By introducing the temporal label smoothing technique, the label assignment method is improved, reducing the fuzziness and uncertainty near the action boundaries. - **Experimental verification**: Experiments were carried out on two commonly used datasets, THUMOS'14 and TVSeries, demonstrating the superiority of this method in performance and efficiency. These improvements enable the TLS - RWKV model to run at a speed of over 600 FPS while maintaining high precision, and can even reach a speed of over 200 FPS on the CPU, which is suitable for applications in low - resource environments such as edge computing.

TLS-RWKV: Real-Time Online Action Detection with Temporal Label Smoothing

Annealing Temporal-Spatial Contrastive Learning for Multi-View Online Action Detection

An empirical study on temporal modeling for online action detection

Real-time Online Video Detection with Temporal Smoothing Transformers

Temporal Distinct Representation Learning for Action Recognition

YOWOv2: A Stronger yet Efficient Multi-level Detection Framework for Real-time Spatio-temporal Action Detection

OadTR: Online Action Detection with Transformers

Long Short-Term Transformer for Online Action Detection

Spatial–Temporal Context-Aware Online Action Detection and Prediction

TKD: Temporal Knowledge Distillation for Active Perception

DFAformer: A Dual Filtering Auxiliary Transformer for Efficient Online Action Detection in Streaming Videos.

Temporally Identity-Aware SSD With Attentional LSTM

MALT: Multi-scale Action Learning Transformer for Online Action Detection

Efficient Video Action Detection with Token Dropout and Context Refinement.

A Tracking-Based Two-Stage Framework for Spatio-Temporal Action Detection

ODTrack: Online Dense Temporal Token Learning for Visual Tracking

TEDdet: Temporal Feature Exchange and Difference Network for Online Real-Time Action Detection

O-TALC: Steps Towards Combating Oversegmentation within Online Action Segmentation

StreamYOLO: Real-time Object Detection for Streaming Perception

ZSTAD: Zero-Shot Temporal Activity Detection

Temporal Dynamic Graph LSTM for Action-driven Video Object Detection