Hidden Minima in Two-Layer ReLU Networks
Yossi Arjevani
2023-12-28
Abstract:The optimization problem associated to fitting two-layer ReLU networks having
$d$~inputs, $k$~neurons, and labels generated by a target network, is
considered. Two types of infinite families of spurious minima, giving one
minimum per $d$, were recently found. The loss at minima belonging to the first
type converges to zero as $d$ increases. In the second type, the loss remains
bounded away from zero. That being so, how may one avoid minima belonging to
the latter type? Fortunately, such minima are never detected by standard
optimization methods. Motivated by questions concerning the nature of this
phenomenon, we develop methods to study distinctive analytic properties of
hidden minima.
By existing analyses, the Hessian spectrum of both types agree modulo
$O(d^{-1/2})$-terms -- not promising. Thus, rather, our investigation proceeds
by studying curves along which the loss is minimized or maximized, generally
referred to as tangency arcs. We prove that apparently far removed group
representation-theoretic considerations concerning the arrangement of subspaces
invariant to the action of subgroups of $S_d$, the symmetry group over $d$
symbols, relative to ones fixed by the action yield a precise description of
all finitely many admissible types of tangency arcs. The general results used
for the loss function reveal that arcs emanating from hidden minima differ,
characteristically, by their structure and symmetry, precisely on account of
the $O(d^{-1/2})$-eigenvalue terms absent in previous work, indicating in
particular the subtlety of the analysis. The theoretical results, stated and
proved for o-minimal structures, show that the set comprising all tangency arcs
is topologically sufficiently tame to enable a numerical construction of
tangency arcs and so compare how minima, both types, are positioned relative to
adjacent critical points.
Machine Learning,Optimization and Control