Improving neural networks with bunches of neurons modeled by Kumaraswamy units: Preliminary study

Jakub Mikolaj Tomczak
DOI: https://doi.org/10.48550/arXiv.1505.02581
2015-05-11
Abstract:Deep neural networks have recently achieved state-of-the-art results in many machine learning problems, e.g., speech recognition or object recognition. Hitherto, work on rectified linear units (ReLU) provides empirical and theoretical evidence on performance increase of neural networks comparing to typically used sigmoid activation function. In this paper, we investigate a new manner of improving neural networks by introducing a bunch of copies of the same neuron modeled by the generalized Kumaraswamy distribution. As a result, we propose novel non-linear activation function which we refer to as Kumaraswamy unit which is closely related to ReLU. In the experimental study with MNIST image corpora we evaluate the Kumaraswamy unit applied to single-layer (shallow) neural network and report a significant drop in test classification error and test cross-entropy in comparison to sigmoid unit, ReLU and Noisy ReLU.
Machine Learning,Neural and Evolutionary Computing
What problem does this paper attempt to address?
This paper attempts to improve the performance of neural networks by introducing a new non - linear activation function - the Kumaraswamy unit. Specifically, the author raises the following questions and conducts research: ### Research Questions: 1. **Can the Kumaraswamy unit be used in a single - layer neural network instead of the traditional Sigmoid, ReLU or Noisy ReLU units to obtain better training results?** ### Research Background: - Deep neural networks have achieved state - of - the - art results in many machine - learning tasks, such as speech recognition and object recognition. - The traditionally used Sigmoid activation function has the problem of vanishing gradients, which may lead to a slow optimization process and getting trapped in local optimal solutions. - ReLU and its variants (such as Leaky ReLU, Parametric ReLU, Adaptive Piecewise Linear Units (APLU)) have shown better performance than Sigmoid in multiple applications. ### Proposed Method: - The author proposes a new modeling method, that is, modeling a batch of neurons with the same weights and biases through the Generalized Kumaraswamy Distribution (KUM - G) to obtain the Kumaraswamy unit. - The Kumaraswamy unit is defined as follows: \[ K\sigma(x|a,b) = 1-(1 - \sigma(x)^a)^b \] where \(\sigma(x)\) is the Sigmoid function, and \(a\) and \(b\) are shape parameters. When \(a = b = 1\), the Kumaraswamy unit degenerates into the Sigmoid function. ### Experimental Verification: - The author conducts experiments on the MNIST handwritten digit classification dataset and compares the performance of different activation functions (Sigmoid, ReLU, Noisy ReLU, Kumaraswamy(5,6), Kumaraswamy(8,30)) in a single - layer neural network. - The cross - entropy loss function is used as an evaluation metric, and training is carried out by the stochastic gradient descent method. ### Experimental Results: - The results show that the Kumaraswamy unit is superior to other activation functions in both the test classification error rate and the test cross - entropy loss. - In particular, Kumaraswamy(8,30) performs the best in all tests, with a test classification error rate of 4.87% and a test cross - entropy loss of 0.16. ### Discussion: - The Kumaraswamy unit can not only approximate the behavior of ReLU in some cases, but also keep the value range between [0, 1], which may be more biologically plausible. - Compared with ReLU, the Kumaraswamy unit reduces the neuron saturation phenomenon, and the average activation value of all neurons is less than 0.5. ### Conclusion: - This research shows that the Kumaraswamy unit performs excellently in a single - layer neural network, providing a good starting point for further research on its application in multi - layer neural networks.