Unraveling Babel: Exploring Multilingual Activation Patterns of LLMs and Their Applications

Weize Liu,Yinlong Xu,Hongxia Xu,Jintai Chen,Xuming Hu,Jian Wu
2024-10-06
Abstract:Recently, large language models (LLMs) have achieved tremendous breakthroughs in the field of NLP, but still lack understanding of their internal neuron activities when processing different languages. We designed a method to convert dense LLMs into fine-grained MoE architectures, and then visually studied the multilingual activation patterns of LLMs through expert activation frequency heatmaps. Through comprehensive experiments on different model families, different model sizes, and different variants, we analyzed the similarities and differences in the internal neuron activation patterns of LLMs when processing different languages. Specifically, we investigated the distribution of high-frequency activated experts, multilingual shared experts, whether multilingual activation patterns are related to language families, and the impact of instruction tuning on activation patterns. We further explored leveraging the discovered differences in expert activation frequencies to guide sparse activation and pruning. Experimental results demonstrated that our method significantly outperformed random expert pruning and even exceeded the performance of unpruned models in some languages. Additionally, we found that configuring different pruning rates for different layers based on activation level differences could achieve better results. Our findings reveal the multilingual processing mechanisms within LLMs and utilize these insights to offer new perspectives for applications such as sparse activation and model pruning.
Computation and Language
What problem does this paper attempt to address?
The problems that this paper attempts to solve are: understanding the internal neuron activity patterns of large language models (LLMs) when processing different languages, and exploring the applications of these patterns. Specifically, the paper focuses on the following aspects: 1. **Research on Multilingual Activation Patterns**: - Research on the distribution of high - frequency activated experts in the processing of different languages. - Explore the existence and distribution of multilingual - shared experts. - Analyze whether multilingual activation patterns are related to language families. 2. **The Influence of Instruction Tuning on Activation Patterns**: - Research how instruction tuning changes the multilingual activation patterns of LLMs. - Understand whether instruction tuning exhibits specific patterns that affect multilingual activation. 3. **Sparse Activation and Model Pruning Based on Activation Frequency**: - Explore using experts frequently activated by different languages for language - specific model pruning. - Propose two pruning methods: threshold - based pruning and activation - frequency - ranking - based pruning. - Verify the performance changes of these methods on different tasks, such as perplexity and accuracy. ### Main Research Contents #### 1. Multilingual Activation Patterns By converting dense LLMs into a fine - grained MoE architecture and calculating the activation frequencies of experts, the authors visualized the activation patterns of different languages during processing. The experimental results show that there are significant differences in the activation patterns of different languages in the shallow and deep layers, and these patterns are related to language families. #### 2. The Influence of Instruction Tuning By comparing the pre - trained model and the instruction - tuned variant, the authors found that instruction tuning significantly changes the expert activation frequencies in the last layer. The activation frequencies of some experts increase, while those of other experts decrease. This helps to better understand the mechanism of action of instruction tuning. #### 3. Sparse Activation and Model Pruning Based on multilingual activation patterns, the authors proposed two pruning methods: - **Threshold - based Pruning**: Only use experts with activation frequencies greater than or equal to a certain threshold for inference. - **Frequency - ranking - based Pruning**: Rank the experts in each layer according to their activation frequencies and only use the top n% of experts. The experimental results show that these methods are significantly better than random pruning on some languages and, in some cases, outperform the performance of the unpruned model. ### Conclusion This research not only reveals the internal mechanisms of LLMs when processing different languages but also provides a new and effective way to achieve sparse activation and model pruning, thereby improving the efficiency and performance of the model. ### Formula Representation The formulas involved in the paper include, but are not limited to, the following: - **Z - score Normalization**: \[ z=\frac{x - \mu}{\sigma} \] where \(x\) is the original score, \(\mu\) is the mean, and \(\sigma\) is the standard deviation. - **Euclidean Distance**: \[ d(x, y)=\sqrt{\sum_{i = 1}^{n}(x_i - y_i)^2} \] - **Kullback - Leibler Divergence**: \[ D_{KL}(P\parallel Q)=\sum_{i}P(i)\log\frac{P(i)}{Q(i)} \] - **Pearson Correlation Coefficient**: \[ r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2\sum(y_i - \bar{y})^2}} \] These formulas are used to measure the similarity of activation patterns between different languages and to evaluate the effects of pruning methods.