Fractal Pyramid Networks

Zhiqiang Deng,Huimin Yu,Yangqi Long
DOI: https://doi.org/10.48550/arXiv.2106.14694
2021-06-28
Abstract:We propose a new network architecture, the Fractal Pyramid Networks (PFNs) for pixel-wise prediction tasks as an alternative to the widely used encoder-decoder structure. In the encoder-decoder structure, the input is processed by an encoding-decoding pipeline that tries to get a semantic large-channel feature. Different from that, our proposed PFNs hold multiple information processing pathways and encode the information to multiple separate small-channel features. On the task of self-supervised monocular depth estimation, even without ImageNet pretrained, our models can compete or outperform the state-of-the-art methods on the KITTI dataset with much fewer parameters. Moreover, the visual quality of the prediction is significantly improved. The experiment of semantic segmentation provides evidence that the PFNs can be applied to other pixel-wise prediction tasks, and demonstrates that our models can catch more global structure information.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the limitations of the currently widely - used encoder - decoder structure in pixel - level prediction tasks, especially the challenges these models face in capturing global context information and restoring output resolution. Specifically: 1. **Capturing global context information**: The design of traditional classification models is not suitable for pixel - level prediction tasks because they limit the model's ability to capture global context information, which is very important for pixel - level prediction. 2. **Restoring output resolution**: In the encoder - decoder structure, low - level features are used both to generate high - level features and to restore resolution, which may lead to sub - optimal results. Moreover, although the skip connection can help restore high - resolution output, it cannot be regarded as an encoding path because it does not generate high - level semantic features. To solve these problems, the author proposes a new network architecture - Fractal Pyramid Networks (PFNs). This architecture provides multiple information processing paths by fusing the pyramid structure and the fractal structure, and encodes information into multiple independent small - channel features instead of one large - channel semantic feature. This design enables the model to better capture global structure information and has achieved significant performance improvements in tasks such as self - supervised monocular depth estimation and semantic segmentation.