Abstract:The word embedding space in neural models is skewed, and correcting this can improve task performance. We point out that most approaches for modeling, correcting, and measuring the symmetry of an embedding space implicitly assume that the word frequencies are uniform; in reality, word frequencies follow a highly non-uniform distribution, known as Zipf's law. Surprisingly, simply performing PCA whitening weighted by the empirical word frequency that follows Zipf's law significantly improves task performance, surpassing established baselines. From a theoretical perspective, both our approach and existing methods can be clearly categorized: word representations are distributed according to an exponential family with either uniform or Zipfian base measures. By adopting the latter approach, we can naturally emphasize informative low-frequency words in terms of their vector norm, which becomes evident from the information-geometric perspective, and in terms of the loss functions for imbalanced classification. Additionally, our theory corroborates that popular natural language processing methods, such as skip-gram negative sampling, WhiteningBERT, and headless language models, work well just because their word embeddings encode the empirical word frequency into the underlying probabilistic model.

What problem does this paper attempt to address?

### The Problem Addressed by the Paper The paper aims to address the issue of bias in word embedding spaces and proposes a frequency-based method to improve the spatial symmetry of word embeddings. Specifically, the paper points out that most existing methods implicitly assume that word frequency is uniformly distributed when modeling, correcting, and measuring the symmetry of embedding spaces. However, actual word frequency follows a highly non-uniform distribution, namely Zipf's law. Therefore, these methods have significant gaps when dealing with real data. ### Main Contributions 1. **Theoretical Contributions**: - Proposes a new method called "Zipfian Whitening," which performs Principal Component Analysis (PCA) whitening by weighting word frequencies. - Explains from the perspective of information geometry why Zipfian Whitening can better emphasize the information content of low-frequency words. 2. **Experimental Validation**: - In standard sentence-level downstream tasks, such as the Semantic Textual Similarity task (STS-B), Zipfian Whitening significantly outperforms traditional uniform whitening methods and other baseline methods. - Demonstrates through experiments that there is a high correlation between symmetry scores considering word frequency and downstream task performance. ### Method Overview 1. **Defining Embedding Symmetry**: - Defines the symmetry of the embedding space through the zero mean and isotropic position of random vectors. - Proposes two metrics: Degree of Centrality and Degree of Isotropy, to evaluate the symmetry of the embedding space. 2. **Zipfian Whitening Algorithm**: - Uses empirical word frequencies for weighting when calculating expectations. - Specific steps include: Zipfian Centering (calculating weighted mean and subtracting it), Zipfian Decorrelation and Normalization (using Singular Value Decomposition for decorrelation and normalization). 3. **Theoretical Explanation**: - Explains why Zipfian Whitening is superior to uniform whitening through the analysis of generative models, partition functions, and the whitening process. - Emphasizes the importance of low-frequency words in the Zipfian prior model, as these words usually contain more information. ### Experimental Results - On multiple benchmark datasets, the Zipfian Whitening method significantly improves downstream task performance. - Compared to traditional uniform whitening methods, the Zipfian Whitening method performs particularly well in the Semantic Textual Similarity task (STS-B). - There is a high correlation between symmetry scores considering word frequency and downstream task performance, further validating the effectiveness of the method. ### Conclusion By introducing the Zipfian Whitening method, the paper addresses the bias issue in word embedding spaces and achieves significant performance improvements in multiple tasks. This method is not only theoretically significant but also performs excellently in practical applications.

Zipfian Whitening

Improve Word Embedding Using Both Writing and Pronunciation.

Enhanced Double-Carrier Word Embedding Via Phonetics and Writing

Data Noising as Smoothing in Neural Network Language Models

Kernel-Whitening: Overcome Dataset Bias with Isotropic Sentence Embedding

Isotropy Matters: Soft-ZCA Whitening of Embeddings for Semantic Code Search

Whitening Not Recommended for Classification Tasks in LLMs

An Exploration Of Semantic Relations In Neural Word Embeddings Using Extrinsic Knowledge

Compression and the origins of Zipf's law for word frequencies

Are ID Embeddings Necessary? Whitening Pre-trained Text Embeddings for Effective Sequential Recommendation

Investigating Language Universal and Specific Properties in Word Embeddings

Word Embedding Composition for Data Imbalances in Sentiment and Emotion Classification

Simple and Effective Dimensionality Reduction for Word Embeddings

Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective.

Covariance-corrected Whitening Alleviates Network Degeneration on Imbalanced Classification

PWESuite: Phonetic Word Embeddings and Tasks They Facilitate

Word Equations: Inherently Interpretable Sparse Word Embeddingsthrough Sparse Coding

Debiasing Word Embeddings with Nonlinear Geometry

Deconstructing and reconstructing word embedding algorithms

Do Word Embeddings Really Understand Loughran-McDonald's Polarities?

Nurse is Closer to Woman than Surgeon? Mitigating Gender-Biased Proximities in Word Embeddings