Is cross-linguistic advert flaw detection in Wikipedia feasible? A multilingual-BERT-based transfer learning approach

Muyan Li,Heshen Zhou,Jingrui Hou,Ping Wang,Erpei Gao
DOI: https://doi.org/10.1016/j.knosys.2022.109330
2022-09-27
Abstract:Wikipedia is one of the most prominent online platforms from which people acquire knowledge; thus, its article quality should be of great concern. Currently, many scholars focus on the quality assessment and quality flaws detection in Wikipedia articles. However, most of them considered only one language version, typically English. One major obstacle to conducting such research in non-English or multilanguage scenarios is insufficient labeled data. To address this, we introduce transfer learning based on a pretraining multilanguage model to verify whether it is feasible to conduct cross-language flaw detection. Specifically, we chose the Advert flaw (containing content written like an advertisement) as our research objective; French, Spanish, and Chinese as the target language scenarios; and English articles as the source scenario. Multilingual BERT combined with a sequential model was used to extract semantic features and build classifiers. Moreover, we compared three strategies (direct transfer, fine-tuning transfer and nontransfer) to determine the best strategy for cross-language Advert flaw detection at different training sample scales. The experimental results demonstrated that the proposed model trained with the English dataset can identify the Advert flaw in other languages; fine-tuning transfer yields the best performance as the corpus increases.
computer science, artificial intelligence
What problem does this paper attempt to address?