Abstract:As a manner to augment pre-trained large language models (LLM), knowledge injection is critical to develop vertical domain large models and has been widely studied. Although most current approaches, including parameter-efficient fine-tuning (PEFT) and block expansion methods, uniformly apply knowledge across all LLM layers, it raises the question: are all layers equally crucial for knowledge injection? We begin by evaluating the importance of each layer in finding the optimal layer range for knowledge injection. Intuitively, the more important layers should play a more critical role in knowledge injection and deserve a denser injection. We observe performance dips in question-answering benchmarks after the removal or expansion of the shallow layers, and the degradation shrinks as the layer gets deeper, indicating that the shallow layers hold the key to knowledge injection. This insight leads us to propose the S strategy, a post-pretraining strategy of selectively enhancing shallow layers while pruning the less effective deep ones. Based on this strategy, we introduce Llama Slayer-8B and Llama Slayer-8B-Instruct. We experimented on the corpus of code $\&$ math and demonstrated the effectiveness of our strategy. Further experiments across different LLM, Mistral-7B, and a legal corpus confirmed the general applicability of the approach, underscoring its wide-ranging efficacy. Our code is available at: \<a class="link-external link-https" href="https://github.com/txchen-USTC/Llama-Slayer" rel="external noopener nofollow">this https URL</a>

Drop Redundant, Shrink Irrelevant: Selective Knowledge Injection for Language Pretraining

Plug-and-Play Knowledge Injection for Pre-trained Language Models

Kformer: Knowledge Injection in Transformer Feed-Forward Layers

Fine-grained Pluggable Gradient Ascent for Knowledge Unlearning in Language Models

UNTER: A Unified Knowledge Interface for Enhancing Pre-trained Language Models

Revisiting the Knowledge Injection Frameworks

Structure Pre-training and Prompt Tuning for Knowledge Graph Transfer

Knowledge Efficient Deep Learning for Natural Language Processing

Named Entity Recognition Method with External Knowledge Injection in Low-resource Environments

Synthetic Knowledge Ingestion: Towards Knowledge Refinement and Injection for Enhancing Large Language Models

Improved Knowledge Distillation for Pre-trained Language Models via Knowledge Selection

Llama SLayer 8B: Shallow Layers Hold the Key to Knowledge Injection

A Survey of Knowledge Enhanced Pre-trained Models

A Survey of Knowledge Enhanced Pre-trained Language Models

Knowledge Distillation of Black-Box Large Language Models

Knowledge Inheritance for Pre-trained Language Models

From Mimicking to Integrating: Knowledge Integration for Pre-Trained Language Models

A Survey on Knowledge-Enhanced Pre-trained Language Models

Knowledge Distillation as Efficient Pre-training: Faster Convergence, Higher Data-efficiency, and Better Transferability

Knowledge Distillation of Transformer-based Language Models Revisited