Korean-Centered Cross-Lingual Parallel Sentence Corpus Construction Experiment

Danxin Cui,Yude Bi
DOI: https://doi.org/10.1109/ialp61005.2023.10336988
2023-01-01
Abstract:This paper proposes an efficient method for parallel sentence extraction based on LASER model. For our work, we mined reports in 4 languages from DONG-A ILBO website dating from 1999 to 2022. We designed custom sentence segmentation schema based on language features, and regulate the quality of selected sentence with suitable threshold. Experiment on a relatively small data of 2,000 articles (500 articles for each language) demonstrated the outstanding performance of our method, which has laid a solid foundation for our construction of large diachronic Korean-centered four-language parallel sentence corpus.
What problem does this paper attempt to address?