Improving Vietnamese Legal Document Retrieval using Synthetic Data

Son Pham Tien,Hieu Nguyen Doan,An Nguyen Dai,Sang Dinh Viet
2024-12-01
Abstract:In the field of legal information retrieval, effective embedding-based models are essential for accurate question-answering systems. However, the scarcity of large annotated datasets poses a significant challenge, particularly for Vietnamese legal texts. To address this issue, we propose a novel approach that leverages large language models to generate high-quality, diverse synthetic queries for Vietnamese legal passages. This synthetic data is then used to pre-train retrieval models, specifically bi-encoder and ColBERT, which are further fine-tuned using contrastive loss with mined hard negatives. Our experiments demonstrate that these enhancements lead to strong improvement in retrieval accuracy, validating the effectiveness of synthetic data and pre-training techniques in overcoming the limitations posed by the lack of large labeled datasets in the Vietnamese legal domain.
Information Retrieval,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve the problem of data scarcity in Vietnamese legal text retrieval. Specifically, the paper points out that in the field of Vietnamese legal information retrieval, the lack of large - scale annotated datasets has led to the limited performance of existing retrieval systems. To overcome this challenge, the authors propose a method of using large - language models to generate high - quality, diverse synthetic queries. These synthetic queries are then used for pre - training and fine - tuning retrieval models, especially the dual - encoder and ColBERT models. Through this method, the paper aims to improve the accuracy of Vietnamese legal text retrieval. ### Main contributions: 1. **Synthetic query generation**: Based on Vietnamese legal text paragraphs, 500,000 legal queries and their corresponding paragraphs were generated using a pre - trained language model. 2. **Pre - training techniques**: The "Query - as - context Pre - training for Dense Passage Retrieval" technique was implemented to further improve the retrieval performance of the PhoBERT model. 3. **Experimental verification**: By training the dual - encoder and ColBERT models on the newly generated dataset, a significant improvement in retrieval accuracy was demonstrated. ### Method overview: 1. **Data collection**: Legal documents were collected from the thuvienphapluat.vn website and split into small paragraphs suitable for processing. 2. **Generating queries**: The Llama 3 model was used to generate synthetic queries based on legal text paragraphs, and different prompting techniques were used to ensure the diversity and relevance of the queries. 3. **Filtering low - quality queries**: Queries that directly quoted the input paragraphs or were only shallowly relevant were removed to ensure the quality of the generated queries. 4. **Pre - training**: The generated queries were used to pre - train the language model to enhance its ability to understand and retrieve relevant paragraphs. 5. **Hard negative sample mining**: A dataset for fine - tuning was generated, and the accuracy of the model was improved by mining hard negative samples. 6. **Fine - tuning**: The pre - trained model was fine - tuned using the generated data to optimize its performance on specific tasks. ### Experimental results: - **In - domain evaluation**: On the TVPL and Legal Zalo 21 benchmark datasets, the models fine - tuned with the generated synthetic data showed significant performance improvements in all metrics. - **Out - of - domain evaluation**: Although the model was mainly pre - trained and fine - tuned in the legal field, its performance on the Vietnamese Wikipedia Q&A dataset was also better than the baseline method, showing good generalization ability. ### Conclusions and future work: - **Data release**: The generated dataset of 500,000 query - paragraph pairs has been publicly released, hoping to promote further research and development in the field of Vietnamese legal text retrieval. - **Future directions**: Explore using the generated queries as input to further expand the dataset; conduct more in - depth qualitative analysis to compare the advantages and disadvantages of synthetic data and real data; study the performance degradation problems that may occur when the model is mainly trained on synthetic data, and find mitigation strategies. Through the above methods and experiments, the paper has successfully solved the problem of data scarcity in Vietnamese legal text retrieval and provided valuable data resources and technical references for future research.