Node Similarities under Random Projections: Limits and Pathological Cases

Tvrtko Tadić,Cassiano Becker,Jennifer Neville
2024-07-30
Abstract:Random Projections have been widely used to generate embeddings for various graph learning tasks due to their computational efficiency. The majority of applications have been justified through the Johnson-Lindenstrauss Lemma. In this paper, we take a step further and investigate how well dot product and cosine similarity are preserved by random projections when these are applied over the rows of the graph matrix. Our analysis provides new asymptotic and finite-sample results, identifies pathological cases, and tests them with numerical experiments. We specialize our fundamental results to a ranking application by computing the probability of random projections flipping the node ordering induced by their embeddings. We find that, depending on the degree distribution, the method produces especially unreliable embeddings for the dot product, regardless of whether the adjacency or the normalized transition matrix is used. With respect to the statistical noise introduced by random projections, we show that cosine similarity produces remarkably more precise approximations.
Social and Information Networks,Data Structures and Algorithms,Machine Learning,Probability
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper mainly explores the fidelity issues of dot product and cosine similarity when generating node embeddings using Random Projections (RP) in graph learning tasks. Specifically, the paper focuses on the following points: 1. **The influence of random projections on the row vectors of the graph matrix**: - Research how random projections affect the dot product and cosine similarity between the row vectors of the graph matrix (such as the adjacency matrix \(A\) or the transition matrix \(T\)). - Analyze the fidelity of these similarities in low - dimensional embeddings and identify possible pathological cases. 2. **The embedding quality of nodes with different degrees**: - Explore the influence of node degree (degree distribution) on the embedding quality generated by random projections, especially for low - degree and high - degree nodes. - It is found that the dot product is particularly unreliable when dealing with low - degree and high - degree nodes, while the cosine similarity can provide a more accurate approximation. 3. **Stability in ranking applications**: - Evaluate the stability of random projections in ranking tasks by calculating the probability that random projections flip the node ranking induced by embeddings. - The results show that the cosine similarity exhibits significantly higher stability in ranking tasks. 4. **Theoretical and experimental verification**: - Provide new asymptotic and finite - sample results to support the above findings. - Use numerical experiments to verify the theoretical analysis and show the application effects on actual datasets (such as the Wikipedia dataset). ### Main contributions of the paper - **Reveal the influence of node degree on similarity fidelity**: By expressing the Johnson - Lindenstrauss lemma as a function of the degree distribution of the graph, it is proved that the dot product performs poorly when dealing with low - degree and high - degree nodes. - **Propose the RP cosine similarity method**: Prove that the cosine similarity can produce a more accurate approximation under random projections and exhibits higher stability in ranking tasks. - **Combine theory and practice**: Not only provide strict mathematical proofs, but also verify the effectiveness of theoretical results through actual datasets. ### Formula summary - **Asymptotic distribution of dot product**: \[ X_{u*}X_{v*}^T\sim N\left(\frac{n_{uv}}{d_ud_v},\frac{1}{q}\left[\frac{n_{uu}n_{vv}}{d_u^2d_v^2}+\left(\frac{n_{uv}}{d_ud_v}\right)^2\right]\right) \] - **Asymptotic distribution of cosine similarity**: \[ \cos(X_{u*},X_{v*})\sim N\left(\frac{n_{uv}}{\sqrt{n_{uu}n_{vv}}},\frac{1}{q}\left(1 - \frac{n_{uv}^2}{n_{uu}n_{vv}}\right)^2\right) \] - **NDCG calculation formula**: \[ \text{NDCG}_w@l=\frac{\text{DCG}_w^R@l}{\text{DCG}_w@l} \] where, \[ \text{DCG}_w^R@l=\sum_{h:\text{rank}_w^R(h)\leq l}\frac{\text{rel}_{wh}}{\log(\text{rank}_w^R(h)+1)} \] \[ \text{DCG}_w@l=\sum_{h:\text{rank}_w(h)\leq l}\frac{\text{rel}_{wh}}{\log(\text{rank}_w(h)+1)} \] Through these studies, the paper provides important theoretical guidance and practical suggestions for random projection methods in graph learning tasks.