Abstract:The introduction of the Transformer neural network, along with techniques like self-supervised pre-training and transfer learning, has paved the way for advanced models like BERT. Despite BERT's impressive performance, opportunities for further enhancement exist. To our knowledge, most efforts are focusing on improving BERT's performance in English and in general domains, with no study specifically addressing the legal Turkish domain. Our study is primarily dedicated to enhancing the BERT model within the legal Turkish domain through modifications in the pre-training phase. In this work, we introduce our innovative modified pre-training approach by combining diverse masking strategies. In the fine-tuning task, we focus on two essential downstream tasks in the legal domain: name entity recognition and multi-label text classification. To evaluate our modified pre-training approach, we fine-tuned all customized models alongside the original BERT models to compare their performance. Our modified approach demonstrated significant improvements in both NER and multi-label text classification tasks compared to the original BERT model. Finally, to showcase the impact of our proposed models, we trained our best models with different corpus sizes and compared them with BERTurk models. The experimental results demonstrate that our innovative approach, despite being pre-trained on a smaller corpus, competes with BERTurk.

What problem does this paper attempt to address?

The main goal of this paper is to improve the performance of the BERT model in the Turkish legal domain. Specifically, the authors modified the pre-training phase of the BERT model, particularly adjusting the "Next Sentence Prediction" (NSP) task and the "Masked Language Model" (MLM) task. The key issues the paper attempts to address are as follows: 1. **Enhancing BERT's performance in specific domains**: Although BERT performs excellently in various natural language processing tasks, there is still room for improvement in specific domains (such as the Turkish legal domain) and non-English environments. The paper aims to improve the performance of the BERT model in these areas through modifications in the pre-training phase. 2. **Proposing a customized approach for Turkish legal texts**: Existing research mostly focuses on general domains or improvements to BERT in English environments, with relatively few studies specifically targeting the Turkish legal domain. This paper proposes an innovative pre-training method to meet the needs of this specific domain. 3. **Improving pre-training tasks**: - **Replacing the NSP task**: The paper attempts to replace the original NSP task with the "Sentence Order Prediction" (SOP) task to better capture the relationships between sentences. - **Improving the MLM task**: The paper proposes a masking strategy that combines Term Frequency-Inverse Document Frequency (TF-IDF) values and different masking ratios to optimize the model's learning of important vocabulary. 4. **Evaluating the effectiveness of the improved methods**: To verify the effectiveness of the proposed improvements, the authors fine-tuned the modified model and tested it on Named Entity Recognition (NER) and multi-label text classification tasks. Experimental results show that these improvements significantly enhance the model's performance on these two tasks compared to the original BERT model. In summary, this paper aims to enhance the performance of the BERT model in Named Entity Recognition and multi-label text classification tasks in the Turkish legal domain by modifying the pre-training phase. By introducing new pre-training tasks and improved masking strategies, the paper demonstrates that these methods can effectively improve the model's performance.

LegalTurk Optimized BERT for Multi-Label Text Classification and NER

FEDBFPT: an Efficient Federated Learning Framework for BERT Further Pre-Training

Fine-tuning Transformer-based Encoder for Turkish Language Understanding Tasks

Advancing natural language processing (NLP) applications of morphologically rich languages with bidirectional encoder representations from transformers (BERT): an empirical case study for Turkish

Comparing the Performance of NLP Toolkits and Evaluation measures in Legal Tech

AraLegal-BERT: A pretrained language model for Arabic Legal text

LEGAL-BERT: The Muppets straight out of Law School

BERT2D: Two Dimensional Positional Embeddings for Efficient Turkish NLP

Comparison of Pre-trained Language Models for Turkish Address Parsing

RoBERTurk: Adjusting RoBERTa for Turkish

Developing and Evaluating Tiny to Medium-Sized Turkish BERT Models

Multi-BERT: Leveraging Adapters and Prompt Tuning for Low-Resource Multi-Domain Adaptation

The Right Model for the Job: An Evaluation of Legal Multi-Label Classification Baselines

Empirical Study of LLM Fine-Tuning for Text Classification in Legal Document Review

Pre-trained Language Models for the Legal Domain: A Case Study on Indian Law

EstBERT: A Pretrained Language-Specific BERT for Estonian

Pre-training technique to localize medical BERT and enhance biomedical BERT

TurkishBERTweet: Fast and reliable large language model for social media analysis

NarrowBERT: Accelerating Masked Language Model Pretraining and Inference

TookaBERT: A Step Forward for Persian NLU

Turkish Medical Text Classification Using BERT