A Construction Method of Multilingual Comparable Corpus in the Background of Artificial Intelligence and Internet of Things

Dong Shumin,Weng Yu,Chaomurilige
DOI: https://doi.org/10.1109/ithings-greencom-cpscom-smartdata-cybermatics60724.2023.00071
2024-01-01
Abstract:Comparable corpus is a critical component in the application of the Artificial Intelligence and Internet of Things (AIoT). AIoT provides a more extensive data source for corpus, which also presents new requirements and challenges for the construction of comparable corpora in adapting to multilingual application scenarios. To meet the need of it, the comparable corpus plays an essential part of research in language information processing and multilingual application scenarios. However, the multilingual comparable corpus is rare, so there is an urgent need to construct multilingual corpus resources. This paper proposes a method for constructing a multilingual comparable corpus, taking a Chinese-Uighur-Tibetan news corpus as an example, and mapping the different language corpus to a unified language vector space. Then, this paper calculates the similarity between different language news texts and serves as a comparability index to construct comparable relations. Through the decision-making mechanism of minimizing the impossibility, it can candidate a comparable corpus pair of multilingual news which of chapter size to realize the construction of a Chinese-Uighur-Tibetan news comparable corpus (CUTCC). After an evaluation analysis, the results shows that our method is superior in accuracy rate and F value compared to existing method. Finally, multilingual comparable corpus constructed in this study provides valuable data resources support and language service for multilingual situations and AIoT application scenarios.
What problem does this paper attempt to address?