TransformLLM: Adapting Large Language Models via LLM-Transformed Reading Comprehension Text

Iftach Arbel,Yehonathan Refael,Ofir Lindenbaum
2024-10-29
Abstract:Large Language Models (LLMs) have shown promise in highly-specialized domains, however challenges are still present in aspects of accuracy and costs. These limitations restrict the usage of existing models in domain-specific tasks. While fine-tuning pre-trained models have shown promising results, this process can be computationally expensive and require massive datasets of the specialized application in hand. In this work, we bridge that gap. We have developed Phi-2-Legal and Mistral-Legal-7B, which are language models specifically designed for legal applications. These models are based on Phi-2 and Mistral-7B-v0.1, and have gone through continued pre-training with over 500 million tokens of legal texts. Our innovative approach significantly improves capabilities in legal tasks by using Large Language Models (LLMs) to convert raw training data into reading comprehension text. Our legal LLMs have demonstrated superior performance in legal benchmarks, even outperforming models trained on much larger datasets with more resources. This work emphasizes the effectiveness of continued pre-training on domain-specific texts, while using affordable LLMs for data conversion, which gives these models domain expertise while retaining general language understanding capabilities. While this work uses the legal domain as a test case, our method can be scaled and applied to any pre-training dataset, resulting in significant improvements across different tasks. These findings underscore the potential of domain-adaptive pre-training and reading comprehension for the development of highly effective domain-specific language models.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to improve the performance of large - language models (LLMs) in specific fields, especially in the legal field, through continued pre - training. Although existing large - language models perform excellently in general language understanding, they still face challenges in terms of accuracy and cost - effectiveness in specific - field tasks, such as in the fields of law, finance, or biomedicine. These problems limit the application of existing models in specific - field tasks. Specifically, the paper proposes an innovative method that uses large - language models to convert the original training data into reading - comprehension texts and then uses these texts for continued pre - training. This method can not only significantly improve the performance of the model in legal tasks but also achieve this goal under limited resources. The paper demonstrates the effectiveness of this method by developing two language models specifically for the legal field - Phi - 2 - Legal and Mistral - Legal - 7B. These models have undergone continued pre - training on more than 500 million legal texts and perform excellently in legal benchmark tests, even outperforming models trained with larger data sets and more resources. In summary, the main contributions of this paper are as follows: 1. **Using large - language models to convert the original text into reading - comprehension texts** for continued pre - training in the legal field. 2. **Developing an extended evaluation scheme** applicable to generative legal - language models, including improvements to existing legal benchmark tests. These achievements are not only of great significance in the legal field but can also be extended to other fields, such as finance, biology, etc., thus promoting the development of domain - specific language models.