Abstract:User-generated text in social media communication (SMC) is mainly characterized by non-standard form. It may contain code switching (CS) text, a widespread phenomenon in SMC, in addition to noisy elements used, especially in written conversations (use of abbreviations, symbols, emoticons) or misspelled words. All of these factors constitute a wall in front of text mining applications. Common text mining tools are dedicated to standard use of standard languages but cannot deal with other forms, especially written text in social media. To overcome these problems, in this work we present our solution for the normalization of non-standard use of standard and non-standard languages (dialects) in SMC text with the use of existent resources and tools. The main processing in our solution consists of CS normalization from multiple to one language by the use of a machine translation--like approach. This processing relies on a linguistic approach of CS, which aims at identifying automatically the translation source and target languages (without human intervention). The remaining processing operations concern the normalization of SMC special expressions and spelling correction of out-of-vocabulary words. To preserve the coded-switched sentence meaning across translation, we adopt a knowledge-based approach for word sense translation disambiguation reinforced with a multi-lingual vertical context. All of these processes are embedded in what we refer to as the machine normalization system. Our solution can be used as a front-end of text mining processing, enabling the analysis of SMC noisy text. The conducted experiments show that our system performs better than considered baselines.

Normalization of Lithuanian Text Using Regular Expressions

Text Normalization in Mandarin Text-to-speech System.

Text Normalization in Chinese Text-to-Speech System

A Three-Stage Text Normalization Strategy for Mandarin Text-to-Speech Systems

Normalization of Non-Standard Words in Croatian Texts

Normalizing Text using Language Modelling based on Phonetics and String Similarity

Two Approaches to Diachronic Normalization of Polish Texts

Document Structure Analysis and Text Normalization for Chinese Putonghua and Cantonese Text-to-Speech Synthesis

Ship License Number Layout Normalization Based on Regional Texts Fine Localization

Transformer-based Models of Text Normalization for Speech Applications

An End-to-end Chinese Text Normalization Model Based on Rule-guided Flat-Lattice Transformer.

What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations

A Unified Tagging Approach to Text Normalization.

Machine Normalization

On the performance of phonetic algorithms in microtext normalization

Biomedical Text Normalization through Generative Modeling

RNN Approaches to Text Normalization: A Challenge

A Chat About Boring Problems: Studying GPT-based text normalization

A Large-Scale Comparison of Historical Text Normalization Systems

Adapting Sequence to Sequence models for Text Normalization in Social Media