Abstract:User-generated text in social media communication (SMC) is mainly characterized by non-standard form. It may contain code switching (CS) text, a widespread phenomenon in SMC, in addition to noisy elements used, especially in written conversations (use of abbreviations, symbols, emoticons) or misspelled words. All of these factors constitute a wall in front of text mining applications. Common text mining tools are dedicated to standard use of standard languages but cannot deal with other forms, especially written text in social media. To overcome these problems, in this work we present our solution for the normalization of non-standard use of standard and non-standard languages (dialects) in SMC text with the use of existent resources and tools. The main processing in our solution consists of CS normalization from multiple to one language by the use of a machine translation--like approach. This processing relies on a linguistic approach of CS, which aims at identifying automatically the translation source and target languages (without human intervention). The remaining processing operations concern the normalization of SMC special expressions and spelling correction of out-of-vocabulary words. To preserve the coded-switched sentence meaning across translation, we adopt a knowledge-based approach for word sense translation disambiguation reinforced with a multi-lingual vertical context. All of these processes are embedded in what we refer to as the machine normalization system. Our solution can be used as a front-end of text mining processing, enabling the analysis of SMC noisy text. The conducted experiments show that our system performs better than considered baselines.

ASR Normalization for Machine Translation

Pitch envelope based frame level score reweighed algorithm for emotion robust speaker recognition.

The Impact of ASR on Speech-to-Speech Translation Performance.

A Three-Stage Text Normalization Strategy for Mandarin Text-to-Speech Systems

DiffNorm: Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation

What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations

Text Normalization in Mandarin Text-to-speech System.

Machine Normalization

Statistical Thresholding for Robust ASR

Improving the Robustness of Speech Translation

Speaker voice normalization for end-to-end speech translation

Attentive batch normalization for lstm-based acoustic modeling of speech recognition

End-to-end Joint Punctuated and Normalized ASR with a Limited Amount of Punctuated Training Data

Text Normalization in Chinese Text-to-Speech System

Improved Long-Form Spoken Language Translation with Large Language Models

Optimizing Byte-level Representation for End-to-end ASR

Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking

Robust Neural Machine Translation with ASR Errors

Cepstral Shape Normalization (CSN) for Robust Speech Recognition

Research on Score Domain Speaking Rate Normalization for Speaker Recognition

Maximum Gaussianality training for deep speaker vector normalization