Abstract:Tanh is a sigmoidal activation function that suffers from vanishing gradient problem, so researchers have proposed some alternative functions including rectified linear unit ( ReLU ), however those vanishing-proof functions bring some other problem such as bias shift problem and noise-sensitiveness as well. Mainly for overcoming vanishing gradient problem as well as avoiding to introduce other problems, we propose a new activation function named Rectified Linear Tanh ( ReLTanh ) by improving traditional Tanh. ReLTanh is constructed by replacing Tanh ’s saturated waveforms in positive and negative inactive regions with two straight lines, and the slopes of the lines are calculated by the Tanh ’s derivatives at two learnable thresholds. The middle Tanh waveform provides ReLTanh with the ability of nonlinear fitting, and the linear parts contribute to the relief of vanishing gradient problem. Besides, thresholds of ReLTanh that determines the slopes of line parts are learnable, so it can tolerate the variation of inputs and help to minimize the cost function and maximize the data fitting performance. Theoretical proofs by mathematical derivations demonstrate that ReLTanh is available to diminish vanishing gradient problem and feasible to train thresholds. For verifying the practical feasibility and effectiveness of ReLTanh , fault diagnosis experiments for planetary gearboxes and rolling bearings are conducted by stacked autoencoder-based deep neural network (SAE-based DNNs). ReLTanh alleviates successfully vanishing gradient problem and the it learns faster, more steadily and precisely than Tanh , which is consistent with the theoretical analysis. Additionally, ReLTanh surpasses other popular activation functions such as ReLU family, Hexpo and Swish , which shows that ReLTanh has certain applying potential and researching value.

Regularization and Reparameterization Avoid Vanishing Gradients in Sigmoid-Type Networks

Normalized Activation Function: Toward Better Convergence

Gradient Descent Provably Escapes Saddle Points in the Training of Shallow ReLU Networks

Plateau Phenomenon in Gradient Descent Training of ReLU networks: Explanation, Quantification and Avoidance

Stochastic Gradient Descent Introduces an Effective Landscape-Dependent Regularization Favoring Flat Solutions

Robust Implicit Regularization via Weight Normalization

Regularization Matters: A Nonparametric Perspective on Overparametrized Neural Network

When Will Gradient Regularization Be Harmful?

Gradient Descent can Learn Less Over-parameterized Two-layer Neural Networks on Classification Problems

Revise Saturated Activation Functions

Convergence proof for stochastic gradient descent in the training of deep neural networks with ReLU activation for constant target functions

Adaptive Gradient Regularization: A Faster and Generalizable Optimization Technique for Deep Neural Networks

Consistency of Neural Networks with Regularization

Regularized Gauss-Newton for Optimizing Overparameterized Neural Networks

A global convergence theory for deep ReLU implicit networks via over-parameterization

Activation-Descent Regularization for Input Optimization of ReLU Networks

The Implicit Regularization of Dynamical Stability in Stochastic Gradient Descent.

An Adaptive Gradient Regularization Method

The Regularization Effects of Anisotropic Noise in Stochastic Gradient Descent.