The Hidden Pitfalls of the Cosine Similarity Loss

Andrew Draganov,Sharvaree Vadgama,Erik J. Bekkers
2024-06-24
Abstract:We show that the gradient of the cosine similarity between two points goes to zero in two under-explored settings: (1) if a point has large magnitude or (2) if the points are on opposite ends of the latent space. Counterintuitively, we prove that optimizing the cosine similarity between points forces them to grow in magnitude. Thus, (1) is unavoidable in practice. We then observe that these derivations are extremely general -- they hold across deep learning architectures and for many of the standard self-supervised learning (SSL) loss functions. This leads us to propose cut-initialization: a simple change to network initialization that helps all studied SSL methods converge faster.
Machine Learning
What problem does this paper attempt to address?
This paper attempts to solve several key problems in self - supervised learning (SSL) when using cosine similarity as a loss function. Specifically, the paper reveals some hidden flaws in the cosine - similarity loss during the gradient - descent process. These problems include: 1. **The problem of vanishing gradients**: - When the magnitude of a point is large, or when two points are in opposite directions in the latent space, the gradient of the cosine similarity will approach zero. This causes the convergence speed of gradient descent to slow down at least quadratically. - Specifically, if the magnitude of a point is large (i.e., \( \|z_i\| \) is large), or the angle between two points is close to \( \pi \) (i.e., \( \phi_{ij} \approx \pi \)), the gradient will become very small. 2. **The problem of the growth of the embedding - vector magnitude**: - The paper proves that optimizing cosine similarity will cause the magnitude of the embedding vector to keep increasing. This contradicts the requirement that SSL methods usually need to keep the magnitude of the embedding vector small. - This growth is inevitable because the gradient of optimizing cosine similarity is always orthogonal to the current embedding vector, resulting in an increase in magnitude. 3. **The impact on existing SSL methods**: - These problems affect not only contrast - based learning methods (such as SimCLR) but also non - contrast - based learning methods (such as BYOL and SimSiam). This is because the loss functions of these methods are ultimately functions of cosine similarity. - Experimental verification shows that the growth of the embedding - vector magnitude does indeed slow down the convergence speed of SSL, especially under different architectures and training paradigms. 4. **Solutions**: - To alleviate these problems, the paper proposes an initialization method called "cut - initialization". By dividing the network weights by a constant \( c \) at the start of training, the initial magnitude of the embedding vector can be effectively controlled. - Combined with \( \ell_2 \)-normalization, cut - initialization can accelerate the convergence of all studied SSL methods. ### Mathematical derivations 1. **The gradient of cosine similarity**: - Let \( z_i \) and \( z_j \) be two embedding vectors, and the cosine similarity \( \text{LA}_i(Z) = -\hat{z}_i^\top \hat{z}_j \). - The gradient \( \nabla \text{LA}_i \) can be expressed as: \[ \nabla \text{LA}_i = \frac{1}{\|z_i\|} \left( I - \frac{z_i z_i^\top}{\|z_i\|^2} \right) \frac{z_j}{\|z_j\|} \] - where \( \hat{z}_i \) and \( \hat{z}_j \) are the unit vectors of \( z_i \) and \( z_j \), respectively. 2. **The growth of the embedding - vector magnitude**: - After optimizing cosine similarity, the magnitude of the new embedding vector \( z_i' \) satisfies: \[ \|z_i'\| \geq \|z_i\| \] - This is because the gradient direction is always orthogonal to the current embedding vector, resulting in an increase in magnitude. 3. **The upper bound of the convergence speed**: - For embedding vectors \( z_i \) and \( z_j \) with the same magnitude \( \rho \), the upper bound of the change in cosine similarity after one step of gradient descent is: \[ \hat{z}_i'^\top \hat{z}_j' - \hat{z}_i^\top \hat{z}_j < \frac{2\gamma \sin^2(\phi_{ij})}{\rho^2} \] - where \( \