Synthetic Knowledge Ingestion: Towards Knowledge Refinement and Injection for Enhancing Large Language Models

Jiaxin Zhang,Wendi Cui,Yiran Huang,Kamalika Das,Sricharan Kumar
2024-10-13
Abstract:Large language models (LLMs) are proficient in capturing factual knowledge across various domains. However, refining their capabilities on previously seen knowledge or integrating new knowledge from external sources remains a significant challenge. In this work, we propose a novel synthetic knowledge ingestion method called Ski, which leverages fine-grained synthesis, interleaved generation, and assemble augmentation strategies to construct high-quality data representations from raw knowledge sources. We then integrate Ski and its variations with three knowledge injection techniques: Retrieval Augmented Generation (RAG), Supervised Fine-tuning (SFT), and Continual Pre-training (CPT) to inject and refine knowledge in language models. Extensive empirical experiments are conducted on various question-answering tasks spanning finance, biomedicine, and open-generation domains to demonstrate that Ski significantly outperforms baseline methods by facilitating effective knowledge injection. We believe that our work is an important step towards enhancing the factual accuracy of LLM outputs by refining knowledge representation and injection capabilities.
Computation and Language,Artificial Intelligence,Information Retrieval,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to address the challenges that large - language models (LLMs) encounter when processing and integrating new knowledge or optimizing existing knowledge. Specifically, the paper focuses on the following key issues: 1. **Out - of - date knowledge**: The knowledge bases of existing LLMs are usually static and cannot evolve over time, resulting in the information they provide may no longer be accurate or relevant. 2. **Lack of domain knowledge**: Although LLMs perform well in a wide range of fields, they lack sufficient expertise in specific areas such as finance and healthcare. 3. **Catastrophic forgetting**: LLMs may forget the knowledge they have learned before when learning new information, especially the facts that are less common in the training data. 4. **Knowledge injection and digestion**: How to effectively inject new knowledge from external knowledge sources into LLMs and ensure that this knowledge can be correctly understood and utilized by the model. To solve these problems, the paper proposes a new method named **Synthetic Knowledge Ingestion (Ski)**. Ski generates high - quality data representations through the following three strategies: - **Fine - grained Synthesis**: Generate hypothetical questions based on n - gram knowledge context to ensure the relevance and diversity of the questions. - **Interleaved Generation**: Generate questions and answers simultaneously according to specific knowledge, providing direct context alignment and relevance. - **Assemble Augmentation**: Combine question - answer and context pairs with different n - gram spans to balance repetition and diversity. Ski can be combined with three knowledge injection techniques: - **Retrieval Augmented Generation (RAG)** - **Supervised Fine - tuning (SFT)** - **Continual Pre - training (CPT)** Through these methods, Ski significantly improves the performance of LLMs in cross - domain question - answering tasks, especially in fields such as finance, biomedicine, and open - ended generation. ### Summary The core objective of the paper is to improve the accuracy of LLMs' outputs, especially for professional questions in specific domains, by enhancing knowledge representation and injection capabilities.