Abstract:Large language models (LLMs) can store a vast amount of world knowledge, often extractable via question-answering (e.g., "What is Abraham Lincoln's birthday?"). However, do they answer such questions based on exposure to similar questions during training (i.e., cheating), or by genuinely learning to extract knowledge from sources like Wikipedia?
In this paper, we investigate this issue using a controlled biography dataset. We find a strong correlation between the model's ability to extract knowledge and various diversity measures of the training data. $\textbf{Essentially}$, for knowledge to be reliably extracted, it must be sufficiently augmented (e.g., through paraphrasing, sentence shuffling, translations) $\textit{during pretraining}$. Without such augmentation, knowledge may be memorized but not extractable, leading to 0% accuracy, regardless of subsequent instruction fine-tuning.
To understand why this occurs, we employ (nearly) linear probing to demonstrate a strong connection between the observed correlation and how the model internally encodes knowledge -- whether it is linearly encoded in the hidden embeddings of entity names or distributed across other token embeddings in the training text.
This paper provides $\textbf{several key recommendations for LLM pretraining in the industry}$: (1) rewrite the pretraining data -- using small, auxiliary models -- to provide knowledge augmentation, and (2) incorporate more instruction-finetuning data into the pretraining stage before it becomes too late.
What problem does this paper attempt to address?
### Problems the paper attempts to solve
This paper explores how large language models (LLMs) store knowledge during the training process and how to extract this knowledge during the inference process. Specifically, the paper aims to answer the following questions:
1. **Does the language model answer questions by being exposed to similar questions during the training process (i.e., "cheating"), or by truly learning to extract knowledge from data sources (such as Wikipedia)?**
- For example, when asked "What is Abraham Lincoln's birthday?", does the language model remember the answer by having seen similar questions during the training process, or does it answer the question by truly understanding Lincoln's biography?
2. **What is the relationship between the effectiveness of knowledge storage and extraction and the diversity of training data?**
- The author found that the model's knowledge extraction ability is closely related to the diversity of its training data. If the training data is sufficiently augmented (such as by paraphrasing, sentence rearrangement, translation, etc.), knowledge can be more reliably extracted. Conversely, without such augmentation, even with instruction - fine - tuning, knowledge may not be extractable, resulting in zero accuracy.
3. **How can the knowledge extraction ability of the language model be improved through specific strategies in the pre - training and fine - tuning processes?**
- The paper makes several key suggestions, including paraphrasing data in the pre - training stage to provide knowledge augmentation, and introducing more instruction - fine - tuning data in the pre - training stage.
### Main findings
1. **Mixed training is helpful for knowledge extraction**:
- When the model is trained on a mixture of all biographical data and some individual question - and - answer data, it can effectively learn to extract knowledge and generalize it to unseen individuals.
2. **After pre - training only on biographical data, the model has difficulty in extracting knowledge**:
- If the model is pre - trained only on biographical data and then fine - tuned using some individual question - and - answer data, regardless of the model size, pre - training time, or fine - tuning parameters, the model has difficulty answering questions about other individuals. However, through knowledge augmentation (such as paraphrasing, sentence rearrangement, etc.), the accuracy is significantly improved.
3. **Linear probing techniques explain why knowledge augmentation is effective**:
- Through linear probing techniques, the author found that knowledge augmentation enables the model to encode personal knowledge almost linearly in the hidden embeddings. Without augmentation, knowledge is scattered throughout the biographical text, making extraction very difficult.
4. **Augmentation of celebrity data is helpful for minority groups**:
- Even if knowledge augmentation is applied only to a part of the individuals (referred to as "celebrities"), the test accuracy of other individuals (the un - augmented "minority groups") is also significantly improved. This indicates that including celebrity data with diverse writing styles can enhance the model's knowledge extraction ability for minority groups.
5. **Bidirectional models have difficulty in extracting knowledge**:
- Encoder models similar to BERT, whether in mixed training or fine - tuning after pre - training, are unable to extract personal knowledge after fine - tuning, unless the knowledge is a single word or multiple independent words (such as birth month, date, and year).
### Practical implications
1. **Emphasize the importance of pre - training data paraphrasing**:
- Especially for rare but crucial data. It is usually too late to paraphrase during the fine - tuning stage. Without paraphrasing, the model may accurately repeat knowledge data, but its embedding method may hinder retrieval under different prompts, resulting in a waste of model capacity.
2. **The advantage of introducing more instruction - fine - tuning data in the pre - training stage**:
- Mixed - training experiments show that postponing all question - and - answer - type data to the fine - tuning stage is sub - optimal. Introducing question - and - answer - type data in the pre - training stage can enable the model to encode knowledge more effectively.
### Related work
- **Linear probing of knowledge**:
- Linear probing is a commonly used method for examining how a model encodes knowledge. The author found that only when entity - attribute knowledge is augmented by paraphrasing / rearrangement, etc. during pre - training can the model encode knowledge in linear embeddings. Otherwise, although the model can remember the training data, this knowledge is not linearly encoded, making it very difficult to extract knowledge through question - and - answer.
- **Probing the knowledge of language models through question - and - answer**:
- Question - and - answer is a common method for probing the knowledge of pre - trained language models. However, it is not clear whether these models answer questions by extracting knowledge from training sources or by recognizing exact / similar questions in training. The author uses controlled experiments on unseen individuals.