Knesset-DictaBERT: A Hebrew Language Model for Parliamentary Proceedings

Gili Goldin,Shuly Wintner
2024-07-30
Abstract:We present Knesset-DictaBERT, a large Hebrew language model fine-tuned on the Knesset Corpus, which comprises Israeli parliamentary proceedings. The model is based on the DictaBERT architecture and demonstrates significant improvements in understanding parliamentary language according to the MLM task. We provide a detailed evaluation of the model's performance, showing improvements in perplexity and accuracy over the baseline DictaBERT model.
Computation and Language
What problem does this paper attempt to address?
The main goal of this paper is to address the lack of natural language processing (NLP) models specifically tailored for parliamentary records in the Hebrew language processing domain. Specifically, the authors address this issue through the following approaches: 1. **Model Customization**: The authors fine-tuned the existing DictaBERT model (a Hebrew language model based on the BERT architecture) to create a new model named Knesset-DictaBERT. This new model is designed specifically to understand the language used in Israeli parliamentary records. 2. **Dataset Utilization**: To train and fine-tune the model, the authors used the Knesset Corpus, a dataset containing a large number of records from Israeli parliamentary meetings. This dataset includes not only plenary session records but also content from committee meetings. 3. **Performance Evaluation**: By conducting a series of evaluations on the model, including calculating perplexity and the accuracy of predicting masked words, the authors demonstrated that Knesset-DictaBERT significantly outperforms the original DictaBERT model in handling parliamentary texts. Through this work, the authors aim to provide a valuable tool for researchers in the fields of Hebrew language processing and political text analysis, and they hope to further advance NLP technology for low-resource languages.