Abstract:This study discusses the negative impact of the derivative of the activation functions in the output layer of artificial neural networks, in particular in continual learning. We propose Hebbian descent as a theoretical framework to overcome this limitation, which is implemented through an alternative loss function for gradient descent we refer to as Hebbian descent loss. This loss is effectively the generalized log-likelihood loss and corresponds to an alternative weight update rule for the output layer wherein the derivative of the activation function is disregarded. We show how this update avoids vanishing error signals during backpropagation in saturated regions of the activation functions, which is particularly helpful in training shallow neural networks and deep neural networks where saturating activation functions are only used in the output layer. In combination with centering, Hebbian descent leads to better continual learning capabilities. It provides a unifying perspective on Hebbian learning, gradient descent, and generalized linear models, for all of which we discuss the advantages and disadvantages. Given activation functions with strictly positive derivative (as often the case in practice), Hebbian descent inherits the convergence properties of regular gradient descent. While established pairings of loss and output layer activation function (e.g., mean squared error with linear or cross-entropy with sigmoid/softmax) are subsumed by Hebbian descent, we provide general insights for designing arbitrary loss activation function combinations that benefit from Hebbian descent. For shallow networks, we show that Hebbian descent outperforms Hebbian learning, has a performance similar to regular gradient descent, and has a much better performance than all other tested update rules in continual learning. In combination with centering, Hebbian descent implements a forgetting mechanism that prevents catastrophic interference notably better than the other tested update rules. When training deep neural networks, our experimental results suggest that Hebbian descent has better or similar performance as gradient descent.

Hadamard Representations: Augmenting Hyperbolic Tangents in RL

Latent Assistance Networks: Rediscovering Hyperbolic Tangents in RL

Hyperbolic Deep Reinforcement Learning

Deep Representation with ReLU Neural Networks

Sigmoid-weighted linear units for neural network function approximation in reinforcement learning

Effect of Activation Functions on the Training of Overparametrized Neural Nets

TaLU: A Hybrid Activation Function Combining Tanh and Rectified Linear Unit to Enhance Neural Networks

Learning to Represent Action Values as a Hypergraph on the Action Vertices

Zorro: A Flexible and Differentiable Parametric Family of Activation Functions That Extends ReLU and GELU

Taming the ReLU with Parallel Dither in a Deep Neural Network

Stable Hadamard Memory: Revitalizing Memory-Augmented Agents for Reinforcement Learning

Hebbian Descent: A Unified View on Log-Likelihood Learning

Deep Neural Networks with ReLU-Sine-Exponential Activations Break Curse of Dimensionality in Approximation on Hölder Class.

Leveraging Continuously Differentiable Activation Functions for Learning in Quantized Noisy Environments

Neural networks with ReLU powers need less depth

Can Temporal-Difference and Q-Learning Learn Representation? A Mean-Field Theory

QHD: A brain-inspired hyperdimensional reinforcement learning algorithm

Can Differentiable Decision Trees Enable Interpretable Reward Learning from Human Feedback?

Adaptive Rational Activations to Boost Deep Reinforcement Learning

Leaky ReLUs That Differ in Forward and Backward Pass Facilitate Activation Maximization in Deep Neural Networks