Automatic biographical information extraction from local gazetteers with Bi-LSTM-CRF model and BERT

Zhou Liu,Hongsu Wang,Peter K. Bol
DOI: https://doi.org/10.1007/s42803-022-00059-2
2022-11-23
International Journal of Digital Humanities
Abstract:Named entity information in Chinese local gazetteers supports extending the Chinese Biographical Database (CBDB) project. Instead of using regular expressions method and manual work to tag biographical information using LoGaRT, we propose an automatic deep learning method that uses tagged data to train a Bi-LSTM-CRF model that can then be applied to an untagged dataset without manual work. This method can not only dramatically improve tagging efficiency, but also overcome the shortcomings of regular expression in named entity justification by utilizing semantic information. Moreover, we employ the advanced pre-trained language model, BERT, to encode our word vectors and further improve performance. This method has performed very well on our local gazetteers dataset and extracted data for CBDB. This experiment can also support our further work on unstructured, narrative historical data and demonstrates the applicability of deep learning methods to ancient Chinese texts.
What problem does this paper attempt to address?