Video RWKV:Video Action Recognition Based RWKV

Zhuowen Yin,Chengru Li,Xingbo Dong
2024-11-08
Abstract:To address the challenges of high computational costs and long-distance dependencies in exist ing video understanding methods, such as CNNs and Transformers, this work introduces RWKV to the video domain in a novel way. We propose a LSTM CrossRWKV (LCR) framework, designed for spatiotemporal representation learning to tackle the video understanding task. Specifically, the proposed linear complexity LCR incorporates a novel Cross RWKV gate to facilitate interaction be tween current frame edge information and past features, enhancing the focus on the subject through edge features and globally aggregating inter-frame features over time. LCR stores long-term mem ory for video processing through an enhanced LSTM recurrent execution mechanism. By leveraging the Cross RWKV gate and recurrent execution, LCR effectively captures both spatial and temporal features. Additionally, the edge information serves as a forgetting gate for LSTM, guiding long-term memory <a class="link-external link-http" href="http://management.Tube" rel="external noopener nofollow">this http URL</a> masking strategy reduces redundant information in food and reduces <a class="link-external link-http" href="http://overfitting.These" rel="external noopener nofollow">this http URL</a> advantages enable LSTM CrossRWKV to set a new benchmark in video under standing, offering a scalable and efficient solution for comprehensive video analysis. All code and models are publicly available.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
This paper attempts to solve the problems of high computational cost and long - distance dependence in video understanding. Existing video understanding methods, such as Convolutional Neural Networks (CNNs) and Transformers, although they perform excellently in capturing spatio - temporal features, these methods usually require a large amount of computational resources, which limits their scalability and practical deployment capabilities. For this reason, this paper introduces RWKV into the video field and proposes a new LSTM CrossRWKV (LCR) framework for spatio - temporal representation learning to deal with video understanding tasks. Specifically, the LCR framework improves video processing in the following ways: 1. **Linear Complexity**: LCR adopts a design of linear complexity and promotes the interaction between the current frame edge information and past features through a novel Cross RWKV gate mechanism, thereby enhancing the focus on the subject and aggregating inter - frame features on a global scale. 2. **Long - term Memory Storage**: LCR stores long - term memory in video processing through an enhanced LSTM recursive execution mechanism. 3. **Edge Information as Forget Gate**: Edge information is used as the forget gate of LSTM to guide long - term memory management and reduce the influence of redundant information. 4. **Tube Masking Strategy**: This strategy reduces redundant information in the video and further reduces the risk of over - fitting. These improvements enable LSTM CrossRWKV to set new benchmarks in video understanding tasks and provide a scalable and efficient comprehensive video analysis solution.