Abstract:Probing techniques for large language models (LLMs) have primarily focused on English, overlooking the vast majority of the world's languages. In this paper, we extend these probing methods to a multilingual context, investigating the behaviors of LLMs across diverse languages. We conduct experiments on several open-source LLM models, analyzing probing accuracy, trends across layers, and similarities between probing vectors for multiple languages. Our key findings reveal: (1) a consistent performance gap between high-resource and low-resource languages, with high-resource languages achieving significantly higher probing accuracy; (2) divergent layer-wise accuracy trends, where high-resource languages show substantial improvement in deeper layers similar to English; and (3) higher representational similarities among high-resource languages, with low-resource languages demonstrating lower similarities both among themselves and with high-resource languages. These results highlight significant disparities in LLMs' multilingual capabilities and emphasize the need for improved modeling of low-resource languages.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to explore the performance of large language models (LLMs) in a multilingual environment, especially how accurate the probes of other languages are in these models compared to English. Specifically, the paper focuses on the following aspects:
1. **Probe Accuracy**: To study whether other languages, apart from English, can achieve the same probe accuracy in large - language models as English.
2. **Inter - layer Trends**: To explore whether other languages follow similar trends to English in different layers, especially the improvement in accuracy in deep neural networks.
3. **Probe Vector Similarity**: To analyze the probe vector similarity between different languages, as well as the differences in similarity between high - resource languages and low - resource languages.
### Research Background
Large language models (such as GPT - 4, Claude 3.5, Llama 3) have made significant progress in natural language processing tasks. However, most research on these models has mainly focused on English, ignoring about 7,000 other languages in the world. This research gap limits our understanding of large language models in a multilingual environment, especially their performance on low - resource languages.
### Main Findings
1. **Performance Gap**: The probe accuracy of high - resource languages (such as French, German, Chinese, Spanish, Russian, Indonesian) is significantly higher than that of low - resource languages (such as Oriya, Hindi, Burmese, Hawaiian, Kannada, Tamil, Telugu, Kazakh, Turkmen).
2. **Inter - layer Trends**: The accuracy of high - resource languages improves significantly in deep neural networks, similar to the performance of English; while the accuracy of low - resource languages is relatively stable or has only a slight improvement.
3. **Vector Similarity**: The probe vector similarity between high - resource languages is high, while low - resource languages not only have low similarity among themselves but also have low similarity with high - resource languages.
### Experimental Setup
- **Model**: Two open - source large - language - model families were used: Qwen and Gemma.
- **Dataset**: Two datasets were used:
- **Cities**: It contains 1,496 samples, involving the judgment of the authenticity and falsity of city locations.
- **Opinion**: It contains 1,000 samples, involving the judgment of the opinion polarity of 20 well - known hotels.
- **Language**: It covers 16 languages, of which 7 are classified as high - resource languages, and the rest are low - resource languages.
### Probe Method
- **Linear Classifier Probe**: By extracting the hidden states of each layer and training a logistic regression model, the ability of the model to encode information in different layers is evaluated.
- **Objective Function**: A logistic regression classifier with L2 regularization is used, and its objective function is:
\[
J(\theta)=-\frac{1}{n} \sum_{i = 1}^{n} L(h^{(i)}, y^{(i)}; \theta)+\frac{\lambda}{2n}\|\theta\|_{2}^{2}
\]
where \(L(h^{(i)}, y^{(i)}; \theta)\) represents the cross - entropy loss function:
\[
L(h^{(i)}, y^{(i)}; \theta)=y^{(i)} \log(\sigma(\theta^{T} h^{(i)}))+(1 - y^{(i)}) \log(1 - \sigma(\theta^{T} h^{(i)}))
\]
### Conclusion
The paper experimentally verifies that high - resource languages perform better than low - resource languages in large - language models, and the accuracy of high - resource languages improves significantly in deep neural networks. In addition, the probe vector similarity between high - resource languages is high, while the similarity of low - resource languages is low. These findings emphasize the importance of improving the modeling of low - resource languages.