Abstract:In this paper, we present a comprehensive study and propose several novel techniques for implementing 3D convolutional blocks using 2D and/or 1D convolutions with only 4D and/or 3D tensors. Our motivation is that 3D convolutions with 5D tensors are computationally very expensive and they may not be supported by some of the edge devices used in real-time applications such as robots. The existing approaches mitigate this by splitting the 3D kernels into spatial and temporal domains, but they still use 3D convolutions with 5D tensors in their implementations. We resolve this issue by introducing some appropriate 4D/3D tensor reshaping as well as new combination techniques for spatial and temporal splits. The proposed implementation methods show significant improvement both in terms of efficiency and accuracy. The experimental results confirm that the proposed spatio-temporal processing structure outperforms the original model in terms of speed and accuracy using only 4D tensors with fewer parameters.
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are the high computational cost of 3D convolution in video analysis and the insufficient support for edge devices. Specifically, traditional 3D convolution uses 5D tensors (Tensor Shape = [B, T, X, Y, C]), where B is the batch size, T is the number of frames, X and Y are the width and height, and C is the number of channels. Although this method can effectively extract spatio - temporal information, the computational complexity is very high and it may not be supported in real - time applications (such as edge devices like robots).
To solve this problem, the author proposes some new techniques to implement 3D convolution blocks. These techniques use only 4D or 3D tensors and simulate the effect of 3D convolution through 2D and/or 1D convolution operations. Specifically, the main contributions of the paper include:
1. **Reducing computational complexity**: By decomposing 3D convolution into 2D and 1D convolutions, the computational complexity is significantly reduced, enabling the model to run on resource - limited edge devices.
2. **Improving efficiency and accuracy**: The experimental results show that the proposed method not only improves the processing speed but also enhances the accuracy of the model. For example, the proposed "Proposed - Add" method reduces the number of parameters by 51% compared to the baseline 3D - CNN, the FLOPs are also reduced by 51%, and the inference speed is increased by 12% at the same time.
3. **Flexible combination methods**: The author explores different spatial and temporal analysis combination methods (such as sequential, parallel, addition, concatenation, etc.) to find the most efficient and accurate combination method.
### Abstract
This paper presents a comprehensive study and introduces several novel techniques to implement 3D convolution blocks using 2D and/or 1D convolutions and 4D and/or 3D tensors. The motivation is that the computational cost of 3D convolution with 5D tensors is very high and it may not be supported in some real - time applications (such as robots). Existing methods alleviate this problem by decomposing the 3D kernel into the spatial and temporal domains, but still use 3D convolution with 5D tensors. We solve this problem by introducing appropriate 4D/3D tensor reshaping and new spatial and temporal segmentation combination techniques. The experimental results show that the proposed spatio - temporal processing structure achieves higher speed and accuracy with fewer parameters when using only 4D tensors.
### Conclusion
Through appropriate technical reshaping, 4D tensors can be used instead of 5D tensors in the proposed parallel structure. In addition, the 4D tensors processed by the two branches have the same shape, so that they can be efficiently combined by simple addition. Using only 2D kernels makes spatial and temporal processing more efficient and significantly reduces memory consumption, which is crucial for real - time applications, especially edge devices.