Abstract:Probing techniques for large language models (LLMs) have primarily focused on English, overlooking the vast majority of the world's languages. In this paper, we extend these probing methods to a multilingual context, investigating the behaviors of LLMs across diverse languages. We conduct experiments on several open-source LLM models, analyzing probing accuracy, trends across layers, and similarities between probing vectors for multiple languages. Our key findings reveal: (1) a consistent performance gap between high-resource and low-resource languages, with high-resource languages achieving significantly higher probing accuracy; (2) divergent layer-wise accuracy trends, where high-resource languages show substantial improvement in deeper layers similar to English; and (3) higher representational similarities among high-resource languages, with low-resource languages demonstrating lower similarities both among themselves and with high-resource languages. These results highlight significant disparities in LLMs' multilingual capabilities and emphasize the need for improved modeling of low-resource languages.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to explore the performance of large language models (LLMs) in a multilingual environment, especially how accurate the probes of other languages are in these models compared to English. Specifically, the paper focuses on the following aspects: 1. **Probe Accuracy**: To study whether other languages, apart from English, can achieve the same probe accuracy in large - language models as English. 2. **Inter - layer Trends**: To explore whether other languages follow similar trends to English in different layers, especially the improvement in accuracy in deep neural networks. 3. **Probe Vector Similarity**: To analyze the probe vector similarity between different languages, as well as the differences in similarity between high - resource languages and low - resource languages. ### Research Background Large language models (such as GPT - 4, Claude 3.5, Llama 3) have made significant progress in natural language processing tasks. However, most research on these models has mainly focused on English, ignoring about 7,000 other languages in the world. This research gap limits our understanding of large language models in a multilingual environment, especially their performance on low - resource languages. ### Main Findings 1. **Performance Gap**: The probe accuracy of high - resource languages (such as French, German, Chinese, Spanish, Russian, Indonesian) is significantly higher than that of low - resource languages (such as Oriya, Hindi, Burmese, Hawaiian, Kannada, Tamil, Telugu, Kazakh, Turkmen). 2. **Inter - layer Trends**: The accuracy of high - resource languages improves significantly in deep neural networks, similar to the performance of English; while the accuracy of low - resource languages is relatively stable or has only a slight improvement. 3. **Vector Similarity**: The probe vector similarity between high - resource languages is high, while low - resource languages not only have low similarity among themselves but also have low similarity with high - resource languages. ### Experimental Setup - **Model**: Two open - source large - language - model families were used: Qwen and Gemma. - **Dataset**: Two datasets were used: - **Cities**: It contains 1,496 samples, involving the judgment of the authenticity and falsity of city locations. - **Opinion**: It contains 1,000 samples, involving the judgment of the opinion polarity of 20 well - known hotels. - **Language**: It covers 16 languages, of which 7 are classified as high - resource languages, and the rest are low - resource languages. ### Probe Method - **Linear Classifier Probe**: By extracting the hidden states of each layer and training a logistic regression model, the ability of the model to encode information in different layers is evaluated. - **Objective Function**: A logistic regression classifier with L2 regularization is used, and its objective function is: \[ J(\theta)=-\frac{1}{n} \sum_{i = 1}^{n} L(h^{(i)}, y^{(i)}; \theta)+\frac{\lambda}{2n}\|\theta\|_{2}^{2} \] where \(L(h^{(i)}, y^{(i)}; \theta)\) represents the cross - entropy loss function: \[ L(h^{(i)}, y^{(i)}; \theta)=y^{(i)} \log(\sigma(\theta^{T} h^{(i)}))+(1 - y^{(i)}) \log(1 - \sigma(\theta^{T} h^{(i)})) \] ### Conclusion The paper experimentally verifies that high - resource languages perform better than low - resource languages in large - language models, and the accuracy of high - resource languages improves significantly in deep neural networks. In addition, the probe vector similarity between high - resource languages is high, while the similarity of low - resource languages is low. These findings emphasize the importance of improving the modeling of low - resource languages.

Exploring Multilingual Probing in Large Language Models: A Cross-Language Analysis

Probing the Emergence of Cross-lingual Alignment during LLM Training

Universal and Independent: Multilingual Probing Framework for Exhaustive Model Interpretation and Evaluation

A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers

Probing Language Models on Their Knowledge Source

Detecting Languages Unintelligible to Multilingual Models through Local Structure Probes

Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers

How do Large Language Models Handle Multilingualism?

How Large Language Models Encode Context Knowledge? A Layer-Wise Probing Study

Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners

A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias

How to Probe Sentence Embeddings in Low-Resource Languages: On Structural Design Choices for Probing Task Evaluation

Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language Models

Probing Multilingual Sentence Representations With X-Probe

Evaluating and Mitigating Linguistic Discrimination in Large Language Models

Probing Multimodal Large Language Models for Global and Local Semantic Representations

Converging to a Lingua Franca: Evolution of Linguistic Regions and Semantics Alignment in Multilingual Large Language Models

Probing Pretrained Language Models for Lexical Semantics

Revealing the Parallel Multilingual Learning within Large Language Models

Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ

Counterfactually Probing Language Identity in Multilingual Models