Neural networks with trainable matrix activation functions

Zhengqi Liu,Shuhao Cao,Yuwen Li,Ludmil Zikatanov
2024-10-28
Abstract:The training process of neural networks usually optimize weights and bias parameters of linear transformations, while nonlinear activation functions are pre-specified and fixed. This work develops a systematic approach to constructing matrix-valued activation functions whose entries are generalized from ReLU. The activation is based on matrix-vector multiplications using only scalar multiplications and comparisons. The proposed activation functions depend on parameters that are trained along with the weights and bias vectors. Neural networks based on this approach are simple and efficient and are shown to be robust in numerical experiments.
Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the limitations of nonlinear activation functions in traditional neural networks. Specifically, in the training process of existing deep neural networks (DNN), only the weights and bias parameters of linear transformations are usually optimized, while the nonlinear activation functions are pre - specified and fixed. This method of fixed activation functions has some problems: 1. **Difficulty in choosing an appropriate activation function**: For a specific application, it is very difficult to determine the optimal activation function in advance. 2. **Performance bottleneck**: Although existing activation functions (such as ReLU and its variants) are effective, there are still performance bottlenecks in some cases, such as the "dying ReLU" problem and the vanishing gradient problem. To solve these problems, the paper proposes a systematic method to construct matrix - valued activation functions. The elements of these activation functions are generalized from ReLU and depend on trainable parameters. In this way, the activation functions can be adaptively adjusted to better adapt to the data and task requirements. ### Main contributions 1. **Introduction of trainable matrix - valued activation functions (TMAF)**: - The activation functions are based on matrix - vector multiplication and only use scalar multiplication and comparison operations. - The proposed activation functions depend on trainable parameters, which are trained together with the weight and bias vectors. - This method makes the neural network simpler, more efficient, and shows stronger robustness in numerical experiments. 2. **Extension of the form of activation functions**: - Expand from diagonal matrix activation functions to more general tridiagonal matrix activation functions, and even theoretically can be extended to full - matrix activation functions. - By adjusting the diagonal and non - diagonal elements, nonlinear mixing can be carried out in the channel dimension, thereby improving the expressive ability of the model. 3. **Verification of the effectiveness of the method**: - Experimental verification has been carried out in tasks such as function approximation and image classification, including the MNIST and CIFAR - 10 datasets. - The experimental results show that TMAF is superior to the traditional ReLU activation function in multiple tasks, especially when dealing with high - frequency oscillation functions. ### Mathematical formulas To describe the specific form of TMAF, the following formulas are defined in the paper: - The form of the diagonal matrix activation function \( D_\ell \) is: \[ D_\ell(y)=\text{diag}(\alpha_{\ell,1}(y_1),\alpha_{\ell,2}(y_2),\ldots,\alpha_{\ell,n_\ell}(y_{n_\ell})),\quad y\in\mathbb{R}^{n_\ell} \] where \( \alpha_{\ell,i}(s) \) is a piecewise constant function, and the specific form is as follows: \[ \alpha_{\ell,i}(s)= \begin{cases} t_{\ell,i,0},&s\in(-\infty,s_{\ell,i,1}]\\ t_{\ell,i,1},&s\in(s_{\ell,i,1},s_{\ell,i,2}]\\ \vdots\\ t_{\ell,i,m_{\ell,i}-1},&s\in(s_{\ell,i,m_{\ell,i}-1},s_{\ell,i,m_{\ell,i}}]\\ t_{\ell,i,m_{\ell,i}},&s\in(s_{\ell,i,m_{\ell,i}},\infty) \end{cases} \] Through these improvements, the paper shows the superior performance of TMAF in various tasks, especially when dealing with complex and high - frequency signals.