Abstract:Large language models (LLMs) are proficient in capturing factual knowledge across various domains. However, refining their capabilities on previously seen knowledge or integrating new knowledge from external sources remains a significant challenge. In this work, we propose a novel synthetic knowledge ingestion method called Ski, which leverages fine-grained synthesis, interleaved generation, and assemble augmentation strategies to construct high-quality data representations from raw knowledge sources. We then integrate Ski and its variations with three knowledge injection techniques: Retrieval Augmented Generation (RAG), Supervised Fine-tuning (SFT), and Continual Pre-training (CPT) to inject and refine knowledge in language models. Extensive empirical experiments are conducted on various question-answering tasks spanning finance, biomedicine, and open-generation domains to demonstrate that Ski significantly outperforms baseline methods by facilitating effective knowledge injection. We believe that our work is an important step towards enhancing the factual accuracy of LLM outputs by refining knowledge representation and injection capabilities.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to address the challenges that large - language models (LLMs) encounter when processing and integrating new knowledge or optimizing existing knowledge. Specifically, the paper focuses on the following key issues: 1. **Out - of - date knowledge**: The knowledge bases of existing LLMs are usually static and cannot evolve over time, resulting in the information they provide may no longer be accurate or relevant. 2. **Lack of domain knowledge**: Although LLMs perform well in a wide range of fields, they lack sufficient expertise in specific areas such as finance and healthcare. 3. **Catastrophic forgetting**: LLMs may forget the knowledge they have learned before when learning new information, especially the facts that are less common in the training data. 4. **Knowledge injection and digestion**: How to effectively inject new knowledge from external knowledge sources into LLMs and ensure that this knowledge can be correctly understood and utilized by the model. To solve these problems, the paper proposes a new method named **Synthetic Knowledge Ingestion (Ski)**. Ski generates high - quality data representations through the following three strategies: - **Fine - grained Synthesis**: Generate hypothetical questions based on n - gram knowledge context to ensure the relevance and diversity of the questions. - **Interleaved Generation**: Generate questions and answers simultaneously according to specific knowledge, providing direct context alignment and relevance. - **Assemble Augmentation**: Combine question - answer and context pairs with different n - gram spans to balance repetition and diversity. Ski can be combined with three knowledge injection techniques: - **Retrieval Augmented Generation (RAG)** - **Supervised Fine - tuning (SFT)** - **Continual Pre - training (CPT)** Through these methods, Ski significantly improves the performance of LLMs in cross - domain question - answering tasks, especially in fields such as finance, biomedicine, and open - ended generation. ### Summary The core objective of the paper is to improve the accuracy of LLMs' outputs, especially for professional questions in specific domains, by enhancing knowledge representation and injection capabilities.

Synthetic Knowledge Ingestion: Towards Knowledge Refinement and Injection for Enhancing Large Language Models

Injecting New Knowledge into Large Language Models via Supervised Fine-Tuning

Drop Redundant, Shrink Irrelevant: Selective Knowledge Injection for Language Pretraining

Revisiting the Knowledge Injection Frameworks

Supervised Knowledge Makes Large Language Models Better In-context Learners

Enhancing Large Language Models with Knowledge Graphs for Robust Question Answering

Structure-aware Domain Knowledge Injection for Large Language Models

Self-Knowledge Guided Retrieval Augmentation for Large Language Models

KITLM: Domain-Specific Knowledge InTegration into Language Models for Question Answering

Chain-of-Knowledge: Grounding Large Language Models via Dynamic Knowledge Adapting over Heterogeneous Sources

Llama SLayer 8B: Shallow Layers Hold the Key to Knowledge Injection

InfuserKI: Enhancing Large Language Models with Knowledge Graphs via Infuser-Guided Knowledge Integration

Plug-and-Play Knowledge Injection for Pre-trained Language Models

KnowTuning: Knowledge-aware Fine-tuning for Large Language Models

Distill-SynthKG: Distilling Knowledge Graph Synthesis Workflow for Improved Coverage and Efficiency

Augmented Large Language Models with Parametric Knowledge Guiding

Enhancing Large Language Models with Pseudo- and Multisource- Knowledge Graphs for Open-ended Question Answering

Meta Knowledge for Retrieval Augmented Large Language Models

Knowledge AI: Fine-tuning NLP Models for Facilitating Scientific Knowledge Extraction and Understanding

Investigating the Factual Knowledge Boundary of Large Language Models with Retrieval Augmentation

Knowledge Plugins: Enhancing Large Language Models for Domain-Specific Recommendations