Joint Model Using Character and Word Embeddings for Detecting Internet Slang Words

Yihong Liu,Yohei Seki
DOI: https://doi.org/10.1007/978-3-030-91669-5_2
2021-01-01
Abstract:The language style on social media platforms is informal and many Internet slang words are used. The presence of such out-of-vocabulary words significantly degrades the performance of language models used for linguistic analysis. This paper presents a novel corpus of Japanese Internet slang words in context and partitions them into two major types and 10 subcategories according to their definitions. The existing word-level or character-level embedding models have shown remarkable improvement with a variety of natural-language processing tasks but often struggle with out-of-vocabulary words such as slang words. We therefore propose a joint model that combines word-level and character-level embeddings as token representations of the text. We have tested our model against other language models with respect to type/subcategory recognition. With fine-grained subcategories, it is possible to analyze the performance of each model in more detail according to the word formation of Internet slang categories. Our experimental results show that our joint model achieves state-of-the-art performance when dealing with Internet slang words, detecting semantic changes accurately while also locating another type of novel combinations of characters.
What problem does this paper attempt to address?