Wavelet Convolutions for Large Receptive Fields

Shahaf E. Finder,Roy Amoyal,Eran Treister,Oren Freifeld
2024-07-15
Abstract:In recent years, there have been attempts to increase the kernel size of Convolutional Neural Nets (CNNs) to mimic the global receptive field of Vision Transformers' (ViTs) self-attention blocks. That approach, however, quickly hit an upper bound and saturated way before achieving a global receptive field. In this work, we demonstrate that by leveraging the Wavelet Transform (WT), it is, in fact, possible to obtain very large receptive fields without suffering from over-parameterization, e.g., for a $k \times k$ receptive field, the number of trainable parameters in the proposed method grows only logarithmically with $k$. The proposed layer, named WTConv, can be used as a drop-in replacement in existing architectures, results in an effective multi-frequency response, and scales gracefully with the size of the receptive field. We demonstrate the effectiveness of the WTConv layer within ConvNeXt and MobileNetV2 architectures for image classification, as well as backbones for downstream tasks, and show it yields additional properties such as robustness to image corruption and an increased response to shapes over textures. Our code is available at <a class="link-external link-https" href="https://github.com/BGU-CS-VIL/WTConv" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper "Wavelet Convolutions for Large Receptive Fields" aims to address the main problem of how to effectively increase the receptive field of Convolutional Neural Networks (CNNs) without causing over-parameterization, in order to approach the global receptive field of Vision Transformers (ViTs). Specifically, the authors point out that the traditional method of increasing the convolution kernel size quickly reaches a limit, and performance starts to saturate at a kernel size of 7×7, with further increases leading to performance degradation. Additionally, simply increasing the kernel size results in a large number of trainable parameters, causing over-parameterization. Therefore, this paper proposes a new method—WTConv (Wavelet Transform Convolution), which utilizes the Wavelet Transform (WT) to achieve a very large receptive field while keeping the growth of the number of parameters logarithmic, thus avoiding the problem of over-parameterization. ### Main Contributions 1. **WTConv Layer**: A new layer, WTConv, is proposed, which uses the Wavelet Transform to effectively increase the receptive field of convolutions. 2. **Plug-and-Play Design**: The WTConv layer can serve as a plug-and-play replacement for deep convolutions without requiring additional modifications to existing architectures. 3. **Experimental Validation**: Extensive experiments on multiple computer vision tasks demonstrate the effectiveness of the WTConv layer, including image classification, semantic segmentation, and object detection. 4. **Analysis**: The paper analyzes the impact of the WTConv layer on the scalability, robustness, shape bias, and effective receptive field (ERF) of CNNs. ### Method Overview - **Wavelet Transform**: The Wavelet Transform is used to decompose the input signal into different frequency components, allowing convolution operations to be performed in different frequency bands. - **Multi-Frequency Response**: By performing convolutions at different levels of wavelet decomposition, the WTConv layer can better capture low-frequency information while keeping the growth of the number of parameters logarithmic. - **Plug-and-Play**: The WTConv layer can be seamlessly integrated into existing CNN architectures without requiring additional modifications. ### Experimental Results - **Image Classification**: On the ImageNet-1K dataset, WTConv significantly improves classification accuracy while adding only a small number of parameters and computational overhead. - **Semantic Segmentation**: On the ADE20K dataset, WTConv, as the backbone of UperNet, significantly improves the mIoU metric. - **Object Detection**: On the COCO dataset, WTConv, as the backbone of Cascade Mask R-CNN, significantly improves the APbox and APmask metrics. ### Conclusion By introducing the WTConv layer, this paper successfully addresses the problem of increasing the receptive field of CNNs without causing over-parameterization, leading to significant performance improvements in multiple computer vision tasks.