Video RWKV:Video Action Recognition Based RWKV

Zhuowen Yin,Chengru Li,Xingbo Dong

2024-11-08

Abstract:To address the challenges of high computational costs and long-distance dependencies in exist ing video understanding methods, such as CNNs and Transformers, this work introduces RWKV to the video domain in a novel way. We propose a LSTM CrossRWKV (LCR) framework, designed for spatiotemporal representation learning to tackle the video understanding task. Specifically, the proposed linear complexity LCR incorporates a novel Cross RWKV gate to facilitate interaction be tween current frame edge information and past features, enhancing the focus on the subject through edge features and globally aggregating inter-frame features over time. LCR stores long-term mem ory for video processing through an enhanced LSTM recurrent execution mechanism. By leveraging the Cross RWKV gate and recurrent execution, LCR effectively captures both spatial and temporal features. Additionally, the edge information serves as a forgetting gate for LSTM, guiding long-term memory <a class="link-external link-http" href="http://management.Tube" rel="external noopener nofollow">this http URL</a> masking strategy reduces redundant information in food and reduces <a class="link-external link-http" href="http://overfitting.These" rel="external noopener nofollow">this http URL</a> advantages enable LSTM CrossRWKV to set a new benchmark in video under standing, offering a scalable and efficient solution for comprehensive video analysis. All code and models are publicly available.

Computer Vision and Pattern Recognition,Machine Learning

What problem does this paper attempt to address?

This paper attempts to solve the problems of high computational cost and long - distance dependence in video understanding. Existing video understanding methods, such as Convolutional Neural Networks (CNNs) and Transformers, although they perform excellently in capturing spatio - temporal features, these methods usually require a large amount of computational resources, which limits their scalability and practical deployment capabilities. For this reason, this paper introduces RWKV into the video field and proposes a new LSTM CrossRWKV (LCR) framework for spatio - temporal representation learning to deal with video understanding tasks. Specifically, the LCR framework improves video processing in the following ways: 1. **Linear Complexity**: LCR adopts a design of linear complexity and promotes the interaction between the current frame edge information and past features through a novel Cross RWKV gate mechanism, thereby enhancing the focus on the subject and aggregating inter - frame features on a global scale. 2. **Long - term Memory Storage**: LCR stores long - term memory in video processing through an enhanced LSTM recursive execution mechanism. 3. **Edge Information as Forget Gate**: Edge information is used as the forget gate of LSTM to guide long - term memory management and reduce the influence of redundant information. 4. **Tube Masking Strategy**: This strategy reduces redundant information in the video and further reduces the risk of over - fitting. These improvements enable LSTM CrossRWKV to set new benchmarks in video understanding tasks and provide a scalable and efficient comprehensive video analysis solution.

Video RWKV:Video Action Recognition Based RWKV

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures

Deep RNN Framework for Visual Sequential Applications

VisualRWKV: Exploring Recurrent Neural Networks for Visual Language Models

VideoMamba: State Space Model for Efficient Video Understanding

CC-LSTM: Cross and Conditional Long-Short Time Memory for Video Captioning

Relational Long Short-Term Memory for Video Action Recognition

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

Towards Long-Form Video Understanding

Deep Multi-Kernel Convolutional LSTM Networks and an Attention-Based Mechanism for Videos

RRWKV: Capturing Long-range Dependencies in RWKV

RWKV-TS: Beyond Traditional Recurrent Neural Network for Time Series Tasks

Towards Real-Time Open-Vocabulary Video Instance Segmentation

RTQ: Rethinking Video-language Understanding Based on Image-text Model

Beyond Frame-level CNN: Saliency-Aware 3-D CNN with LSTM for Video Action Recognition.

Optimizing Robotic Manipulation with Decision-RWKV: A Recurrent Sequence Modeling Approach for Lifelong Learning

TLS-RWKV: Real-Time Online Action Detection with Temporal Label Smoothing

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Making Every Frame Matter: Continuous Video Understanding for Large Models via Adaptive State Modeling

Long-term Residual Recurrent Network for Human Interaction Recognition in Videos

RWKV-CLIP: A Robust Vision-Language Representation Learner