Why do CNNs excel at feature extraction? A mathematical explanation

Vinoth Nandakumar,Arush Tagade,Tongliang Liu
2023-07-03
Abstract:Over the past decade deep learning has revolutionized the field of computer vision, with convolutional neural network models proving to be very effective for image classification benchmarks. However, a fundamental theoretical questions remain answered: why can they solve discrete image classification tasks that involve feature extraction? We address this question in this paper by introducing a novel mathematical model for image classification, based on feature extraction, that can be used to generate images resembling real-world datasets. We show that convolutional neural network classifiers can solve these image classification tasks with zero error. In our proof, we construct piecewise linear functions that detect the presence of features, and show that they can be realized by a convolutional network.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem this paper attempts to address is why Convolutional Neural Networks (CNNs) are particularly adept at feature extraction in image classification tasks. To answer this question, the authors propose a new mathematical model to generate images similar to real datasets and demonstrate that CNNs can perfectly solve these image classification tasks. Specifically, the authors construct some piecewise linear functions to detect features in images and show how these functions can be implemented through convolutional neural networks. The main contribution of the paper is the proposal of a rigorous mathematical framework for describing a series of image classification problems. This framework is based on the observation that objects are composed of a set of constituent features and intuitively posits that an object exists when all features appear in the image. The authors prove that convolutional network classifiers can flawlessly solve such image classification tasks and, in the process of proving this, construct piecewise linear functions to detect features in images. These functions are constructed by summing functions defined on input image patches; each function detects whether the image patch contains the relevant feature. The network used by the authors includes one convolutional layer and multiple fully connected layers, with the number of required convolutional filters increasing linearly with the complexity of the constituent features.