Abstract:We show that the gradient of the cosine similarity between two points goes to zero in two under-explored settings: (1) if a point has large magnitude or (2) if the points are on opposite ends of the latent space. Counterintuitively, we prove that optimizing the cosine similarity between points forces them to grow in magnitude. Thus, (1) is unavoidable in practice. We then observe that these derivations are extremely general -- they hold across deep learning architectures and for many of the standard self-supervised learning (SSL) loss functions. This leads us to propose cut-initialization: a simple change to network initialization that helps all studied SSL methods converge faster.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in self - supervised learning (SSL) when using cosine similarity as a loss function. Specifically, the paper reveals some hidden flaws in the cosine - similarity loss during the gradient - descent process. These problems include: 1. **The problem of vanishing gradients**: - When the magnitude of a point is large, or when two points are in opposite directions in the latent space, the gradient of the cosine similarity will approach zero. This causes the convergence speed of gradient descent to slow down at least quadratically. - Specifically, if the magnitude of a point is large (i.e., \( \|z_i\| \) is large), or the angle between two points is close to \( \pi \) (i.e., \( \phi_{ij} \approx \pi \)), the gradient will become very small. 2. **The problem of the growth of the embedding - vector magnitude**: - The paper proves that optimizing cosine similarity will cause the magnitude of the embedding vector to keep increasing. This contradicts the requirement that SSL methods usually need to keep the magnitude of the embedding vector small. - This growth is inevitable because the gradient of optimizing cosine similarity is always orthogonal to the current embedding vector, resulting in an increase in magnitude. 3. **The impact on existing SSL methods**: - These problems affect not only contrast - based learning methods (such as SimCLR) but also non - contrast - based learning methods (such as BYOL and SimSiam). This is because the loss functions of these methods are ultimately functions of cosine similarity. - Experimental verification shows that the growth of the embedding - vector magnitude does indeed slow down the convergence speed of SSL, especially under different architectures and training paradigms. 4. **Solutions**: - To alleviate these problems, the paper proposes an initialization method called "cut - initialization". By dividing the network weights by a constant \( c \) at the start of training, the initial magnitude of the embedding vector can be effectively controlled. - Combined with \( \ell_2 \)-normalization, cut - initialization can accelerate the convergence of all studied SSL methods. ### Mathematical derivations 1. **The gradient of cosine similarity**: - Let \( z_i \) and \( z_j \) be two embedding vectors, and the cosine similarity \( \text{LA}_i(Z) = -\hat{z}_i^\top \hat{z}_j \). - The gradient \( \nabla \text{LA}_i \) can be expressed as: \[ \nabla \text{LA}_i = \frac{1}{\|z_i\|} \left( I - \frac{z_i z_i^\top}{\|z_i\|^2} \right) \frac{z_j}{\|z_j\|} \] - where \( \hat{z}_i \) and \( \hat{z}_j \) are the unit vectors of \( z_i \) and \( z_j \), respectively. 2. **The growth of the embedding - vector magnitude**: - After optimizing cosine similarity, the magnitude of the new embedding vector \( z_i' \) satisfies: \[ \|z_i'\| \geq \|z_i\| \] - This is because the gradient direction is always orthogonal to the current embedding vector, resulting in an increase in magnitude. 3. **The upper bound of the convergence speed**: - For embedding vectors \( z_i \) and \( z_j \) with the same magnitude \( \rho \), the upper bound of the change in cosine similarity after one step of gradient descent is: \[ \hat{z}_i'^\top \hat{z}_j' - \hat{z}_i^\top \hat{z}_j < \frac{2\gamma \sin^2(\phi_{ij})}{\rho^2} \] - where \( \

The Hidden Pitfalls of the Cosine Similarity Loss

On the Convergence of Gradient Descent for Large Learning Rates

Implicit variance regularization in non-contrastive SSL

Reaching Nirvana: Maximizing the Margin in Both Euclidean and Angular Spaces for Deep Neural Network Classification

Why Learning of Large-Scale Neural Networks Behaves Like Convex Optimization

On the Lipschitz Constant of Deep Networks and Double Descent

Deconstructing the Goldilocks Zone of Neural Network Initialization

Understanding self-supervised Learning Dynamics without Contrastive Pairs

Linear Time Sinkhorn Divergences using Positive Features

Exploring the Sharpened Cosine Similarity

Deep Learning on Small Datasets without Pre-Training using Cosine Loss

Identifying and attacking the saddle point problem in high-dimensional non-convex optimization

Preventing Collapse in Contrastive Learning with Orthonormal Prototypes (CLOP)

Cosine similarity-based adversarial process

On the saddle point problem for non-convex optimization

A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks

Does SGD really happen in tiny subspaces?

Learning Deep Optimal Embeddings with Sinkhorn Divergences

Gradient Descent Maximizes the Margin of Homogeneous Neural Networks.

Asymmetric Valleys: Beyond Sharp and Flat Local Minima.

The Law of Parsimony in Gradient Descent for Learning Deep Linear Networks