Abstract:In this paper, we present a comprehensive study and propose several novel techniques for implementing 3D convolutional blocks using 2D and/or 1D convolutions with only 4D and/or 3D tensors. Our motivation is that 3D convolutions with 5D tensors are computationally very expensive and they may not be supported by some of the edge devices used in real-time applications such as robots. The existing approaches mitigate this by splitting the 3D kernels into spatial and temporal domains, but they still use 3D convolutions with 5D tensors in their implementations. We resolve this issue by introducing some appropriate 4D/3D tensor reshaping as well as new combination techniques for spatial and temporal splits. The proposed implementation methods show significant improvement both in terms of efficiency and accuracy. The experimental results confirm that the proposed spatio-temporal processing structure outperforms the original model in terms of speed and accuracy using only 4D tensors with fewer parameters.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are the high computational cost of 3D convolution in video analysis and the insufficient support for edge devices. Specifically, traditional 3D convolution uses 5D tensors (Tensor Shape = [B, T, X, Y, C]), where B is the batch size, T is the number of frames, X and Y are the width and height, and C is the number of channels. Although this method can effectively extract spatio - temporal information, the computational complexity is very high and it may not be supported in real - time applications (such as edge devices like robots). To solve this problem, the author proposes some new techniques to implement 3D convolution blocks. These techniques use only 4D or 3D tensors and simulate the effect of 3D convolution through 2D and/or 1D convolution operations. Specifically, the main contributions of the paper include: 1. **Reducing computational complexity**: By decomposing 3D convolution into 2D and 1D convolutions, the computational complexity is significantly reduced, enabling the model to run on resource - limited edge devices. 2. **Improving efficiency and accuracy**: The experimental results show that the proposed method not only improves the processing speed but also enhances the accuracy of the model. For example, the proposed "Proposed - Add" method reduces the number of parameters by 51% compared to the baseline 3D - CNN, the FLOPs are also reduced by 51%, and the inference speed is increased by 12% at the same time. 3. **Flexible combination methods**: The author explores different spatial and temporal analysis combination methods (such as sequential, parallel, addition, concatenation, etc.) to find the most efficient and accurate combination method. ### Abstract This paper presents a comprehensive study and introduces several novel techniques to implement 3D convolution blocks using 2D and/or 1D convolutions and 4D and/or 3D tensors. The motivation is that the computational cost of 3D convolution with 5D tensors is very high and it may not be supported in some real - time applications (such as robots). Existing methods alleviate this problem by decomposing the 3D kernel into the spatial and temporal domains, but still use 3D convolution with 5D tensors. We solve this problem by introducing appropriate 4D/3D tensor reshaping and new spatial and temporal segmentation combination techniques. The experimental results show that the proposed spatio - temporal processing structure achieves higher speed and accuracy with fewer parameters when using only 4D tensors. ### Conclusion Through appropriate technical reshaping, 4D tensors can be used instead of 5D tensors in the proposed parallel structure. In addition, the 4D tensors processed by the two branches have the same shape, so that they can be efficiently combined by simple addition. Using only 2D kernels makes spatial and temporal processing more efficient and significantly reduces memory consumption, which is crucial for real - time applications, especially edge devices.

Is 3D Convolution with 5D Tensors Really Necessary for Video Analysis?

2D or Not 2D? Adaptive 3D Convolution Selection for Efficient Video Recognition

Design Light-weight 3D Convolutional Networks for Video Recognition Temporal Residual, Fully Separable Block, and Fast Algorithm

Dynamic Spatio-Temporal Feature Learning via Graph Convolution in 3D Convolutional Networks

Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification

Deformable 3D Convolution for Video Super-Resolution

Exploring Temporal Differences in 3D Convolutional Neural Networks

Video-to-Image Casting: A Flatting Method for Video Analysis.

3D-TDC: A 3D temporal dilation convolution framework for video action recognition

Continual 3D Convolutional Neural Networks for Real-time Processing of Videos

V4D:4D Convolutional Neural Networks for Video-level Representation Learning

A Real-Time Action Representation With Temporal Encoding and Deep Compression

High Performance Implementation of 3D Convolutional Neural Networks on a GPU

Maximizing Spatio-Temporal Entropy of Deep 3D CNNs for Efficient Video Recognition

3D Depthwise Convolution: Reducing Model Parameters in 3D Vision Tasks

Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

Convolutional Tensor-Train LSTM for Spatio-temporal Learning

Would Mega-scale Datasets Further Enhance Spatiotemporal 3D CNNs?

S3TC: Spiking Separated Spatial and Temporal Convolutions with Unsupervised STDP-based Learning for Action Recognition

D3D: Dual 3-D Convolutional Network for Real-Time Action Recognition

VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control