Abstract:Large Language Models (LLMs) have shown promise in highly-specialized domains, however challenges are still present in aspects of accuracy and costs. These limitations restrict the usage of existing models in domain-specific tasks. While fine-tuning pre-trained models have shown promising results, this process can be computationally expensive and require massive datasets of the specialized application in hand. In this work, we bridge that gap. We have developed Phi-2-Legal and Mistral-Legal-7B, which are language models specifically designed for legal applications. These models are based on Phi-2 and Mistral-7B-v0.1, and have gone through continued pre-training with over 500 million tokens of legal texts. Our innovative approach significantly improves capabilities in legal tasks by using Large Language Models (LLMs) to convert raw training data into reading comprehension text. Our legal LLMs have demonstrated superior performance in legal benchmarks, even outperforming models trained on much larger datasets with more resources. This work emphasizes the effectiveness of continued pre-training on domain-specific texts, while using affordable LLMs for data conversion, which gives these models domain expertise while retaining general language understanding capabilities. While this work uses the legal domain as a test case, our method can be scaled and applied to any pre-training dataset, resulting in significant improvements across different tasks. These findings underscore the potential of domain-adaptive pre-training and reading comprehension for the development of highly effective domain-specific language models.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to improve the performance of large - language models (LLMs) in specific fields, especially in the legal field, through continued pre - training. Although existing large - language models perform excellently in general language understanding, they still face challenges in terms of accuracy and cost - effectiveness in specific - field tasks, such as in the fields of law, finance, or biomedicine. These problems limit the application of existing models in specific - field tasks. Specifically, the paper proposes an innovative method that uses large - language models to convert the original training data into reading - comprehension texts and then uses these texts for continued pre - training. This method can not only significantly improve the performance of the model in legal tasks but also achieve this goal under limited resources. The paper demonstrates the effectiveness of this method by developing two language models specifically for the legal field - Phi - 2 - Legal and Mistral - Legal - 7B. These models have undergone continued pre - training on more than 500 million legal texts and perform excellently in legal benchmark tests, even outperforming models trained with larger data sets and more resources. In summary, the main contributions of this paper are as follows: 1. **Using large - language models to convert the original text into reading - comprehension texts** for continued pre - training in the legal field. 2. **Developing an extended evaluation scheme** applicable to generative legal - language models, including improvements to existing legal benchmark tests. These achievements are not only of great significance in the legal field but can also be extended to other fields, such as finance, biology, etc., thus promoting the development of domain - specific language models.

TransformLLM: Adapting Large Language Models via LLM-Transformed Reading Comprehension Text

SaulLM-54B & SaulLM-141B: Scaling Up Domain Adaptation for the Legal Domain

Large Language Models are legal but they are not: Making the case for a powerful LegalLLM

Optimizing Numerical Estimation and Operational Efficiency in the Legal Domain through Large Language Models

A Comprehensive Evaluation of Large Language Models on Legal Judgment Prediction

LawLLM: Law Large Language Model for the US Legal System

SaulLM-7B: A pioneering Large Language Model for Law

Caveat Lector: Large Language Models in Legal Practice

LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development

LegaLMFiT: Efficient Short Legal Text Classification with LSTM Language Model Pre-Training

Lawma: The Power of Specialization for Legal Tasks

MindLLM: Pre-training Lightweight Large Language Model from Scratch, Evaluations and Domain Applications

Exploring LLMs as a Source of Targeted Synthetic Textual Data to Minimize High Confidence Misclassifications

Can Large Language Models Grasp Legal Theories? Enhance Legal Reasoning with Insights from Multi-Agent Collaboration

Legal Documents Drafting with Fine-Tuned Pre-Trained Large Language Model

Large language models as tax attorneys: a case study in legal capabilities emergence

Legal-Tech Open Diaries: Lesson learned on how to develop and deploy light-weight models in the era of humongous Language Models

Using Large Language Models for the Interpretation of Building Regulations

InternLM-Law: An Open Source Chinese Legal Large Language Model

Pre-trained Language Models for the Legal Domain: A Case Study on Indian Law

Leveraging Large Language Models for Learning Complex Legal Concepts through Storytelling