Abstract:Deep neural networks have recently achieved state-of-the-art results in many machine learning problems, e.g., speech recognition or object recognition. Hitherto, work on rectified linear units (ReLU) provides empirical and theoretical evidence on performance increase of neural networks comparing to typically used sigmoid activation function. In this paper, we investigate a new manner of improving neural networks by introducing a bunch of copies of the same neuron modeled by the generalized Kumaraswamy distribution. As a result, we propose novel non-linear activation function which we refer to as Kumaraswamy unit which is closely related to ReLU. In the experimental study with MNIST image corpora we evaluate the Kumaraswamy unit applied to single-layer (shallow) neural network and report a significant drop in test classification error and test cross-entropy in comparison to sigmoid unit, ReLU and Noisy ReLU.

What problem does this paper attempt to address?

This paper attempts to improve the performance of neural networks by introducing a new non - linear activation function - the Kumaraswamy unit. Specifically, the author raises the following questions and conducts research: ### Research Questions: 1. **Can the Kumaraswamy unit be used in a single - layer neural network instead of the traditional Sigmoid, ReLU or Noisy ReLU units to obtain better training results?** ### Research Background: - Deep neural networks have achieved state - of - the - art results in many machine - learning tasks, such as speech recognition and object recognition. - The traditionally used Sigmoid activation function has the problem of vanishing gradients, which may lead to a slow optimization process and getting trapped in local optimal solutions. - ReLU and its variants (such as Leaky ReLU, Parametric ReLU, Adaptive Piecewise Linear Units (APLU)) have shown better performance than Sigmoid in multiple applications. ### Proposed Method: - The author proposes a new modeling method, that is, modeling a batch of neurons with the same weights and biases through the Generalized Kumaraswamy Distribution (KUM - G) to obtain the Kumaraswamy unit. - The Kumaraswamy unit is defined as follows: \[ K\sigma(x|a,b) = 1-(1 - \sigma(x)^a)^b \] where \(\sigma(x)\) is the Sigmoid function, and \(a\) and \(b\) are shape parameters. When \(a = b = 1\), the Kumaraswamy unit degenerates into the Sigmoid function. ### Experimental Verification: - The author conducts experiments on the MNIST handwritten digit classification dataset and compares the performance of different activation functions (Sigmoid, ReLU, Noisy ReLU, Kumaraswamy(5,6), Kumaraswamy(8,30)) in a single - layer neural network. - The cross - entropy loss function is used as an evaluation metric, and training is carried out by the stochastic gradient descent method. ### Experimental Results: - The results show that the Kumaraswamy unit is superior to other activation functions in both the test classification error rate and the test cross - entropy loss. - In particular, Kumaraswamy(8,30) performs the best in all tests, with a test classification error rate of 4.87% and a test cross - entropy loss of 0.16. ### Discussion: - The Kumaraswamy unit can not only approximate the behavior of ReLU in some cases, but also keep the value range between [0, 1], which may be more biologically plausible. - Compared with ReLU, the Kumaraswamy unit reduces the neuron saturation phenomenon, and the average activation value of all neurons is less than 0.5. ### Conclusion: - This research shows that the Kumaraswamy unit performs excellently in a single - layer neural network, providing a good starting point for further research on its application in multi - layer neural networks.

Improving neural networks with bunches of neurons modeled by Kumaraswamy units: Preliminary study

Improving Fault Tolerance for Reliable DNN Using Boundary-Aware Activation

EraseReLU: A Simple Way to Ease the Training of Deep Convolution Neural Networks.

ReLUs Are Sufficient for Learning Implicit Neural Representations

Neural networks with ReLU powers need less depth

Improving performance of recurrent neural network with relu nonlinearity

Effects of the Nonlinearity in Activation Functions on the Performance of Deep Learning Models

Competition-based Adaptive ReLU for Deep Neural Networks

Weight initialization based‐rectified linear unit activation function to improve the performance of a convolutional neural network model

Improving Convolutional Neural Network Using Pseudo Derivative ReLU.

Smooth activations and reproducibility in deep networks

Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks

Hyperbolic Linear Units For Deep Convolutional Neural Networks

Deep Learning with S-shaped Rectified Linear Activation Units

Why ReLU Units Sometimes Die: Analysis of Single-Unit Error Backpropagation in Neural Networks

Properties of the geometry of solutions and capacity of multi-layer neural networks with Rectified Linear Units activations

Parametric Variational Linear Units (PVLUs) in Deep Convolutional Networks

FReLU: Flexible Rectified Linear Units for Improving Convolutional Neural Networks

Adaptive Blending Units: Trainable Activation Functions for Deep Neural Networks

Hidden Unit Specialization in Layered Neural Networks: ReLU vs. Sigmoidal Activation

Growing Cosine Unit: A Novel Oscillatory Activation Function That Can Speedup Training and Reduce Parameters in Convolutional Neural Networks