Abstract:Subword tokenization schemes are the dominant technique used in current NLP models. However, such schemes can be rigid and tokenizers built on one corpus do not adapt well to other parallel corpora. It has also been observed that in multilingual corpora, subword tokenization schemes over-segment low-resource languages leading to a drop in translation performance. A simple alternative to subword tokenizers is byte-based methods i.e. tokenization into byte sequences using encoding schemes such as UTF-8. Byte tokens often represent inputs at a sub-character granularity i.e. one character can be represented by a sequence of multiple byte tokens. This results in byte sequences that are significantly longer than character sequences. Enforcing aggregation of local information in the lower layers can guide the model to build higher-level semantic information. We propose a Local Byte Fusion (LOBEF) method for byte-based machine translation -- utilizing byte $n$-gram and word boundaries -- to aggregate local semantic information. Extensive experiments on multilingual translation, zero-shot cross-lingual transfer, and domain adaptation reveal a consistent improvement over traditional byte-based models and even over subword techniques. Further analysis also indicates that our byte-based models are parameter-efficient and can be trained faster than subword models.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on byte - level representation and multilingual adaptability in Neural Machine Translation (NMT). Specifically: 1. **Problems with byte - level representation**: - Existing multilingual NMT models rely on pre - built sub - word tokenizers (such as BPE, SentencePiece), which lead to two main problems: First, the models lack flexibility when adapting to other parallel corpora; second, when the sub - word tokenizers are built on unbalanced multilingual data, words in low - resource languages will be over - segmented into multiple single characters, thus affecting translation performance. - Although character - level models have been proposed as a solution, they have significant overheads when dealing with long character sequences during training and inference, and still use a fixed character vocabulary, resulting in the same problems as the fixed sub - word tokenizers. 2. **Problems with multilingual adaptability**: - In a multilingual setting, characters of different languages may be tokenized into different numbers of bytes, which will produce byte sequences longer than the original sentences. Traditional byte - level NMT models can only implicitly reconstruct character - level or word - level representations in a completely data - driven manner, which may lead to poor interpretability and sub - optimal results. To solve these problems, the authors propose two Local Byte Fusion (LOBEF) techniques to explicitly aggregate byte - level information and learn character - level and word - level representations. These techniques include: - **n - gram Convolutional Fusion (nCF)**: Use four different 1 - D convolutional layers to aggregate n - gram byte tokens, thereby learning character - level information. - **Word - boundary - based Self - Attention Fusion (WSF)**: Utilize word - boundary information in the original sentence to aggregate byte tokens within words, thereby learning word - level context. Through these methods, the authors aim to improve the performance of byte - level NMT models in multilingual translation, zero - shot / few - shot cross - language transfer, and cross - domain adaptation tasks. Experimental results show that these methods can significantly improve translation quality in multiple languages and domains.

Local Byte Fusion for Neural Machine Translation

Integrating Multi-scale Contextualized Information for Byte-based Neural Machine Translation

Word-Level Representation From Bytes For Language Modeling

Finding Better Subword Segmentation for Neural Machine Translation

Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes

End-to-End Multilingual Text Recognition Based on Byte Modeling

Simple Fusion: Return of the Language Model

Beyond Shared Vocabulary: Increasing Representational Word Similarities across Languages for Multilingual Machine Translation

Enhanced Neural Machine Translation by Joint Decoding with Word and POS-tagging Sequences.

Training Multilingual Pre-trained Language Model with Byte-level Subwords

Optimizing Byte-level Representation for End-to-end ASR

Optimizing Segmentation Granularity for Neural Machine Translation

Multilingual Mix: Example Interpolation Improves Multilingual Neural Machine Translation

Improving Zero-shot Neural Machine Translation on Language-specific Encoders-Decoders

Rethinking the Exploitation of Monolingual Data for Low-Resource Neural Machine Translation

Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations

Neural System Combination For Machine Translation

Local Translation Prediction with Global Sentence Representation

Pivot-Based Semantic Splicing for Neural Machine Translation.

Mitigating Data Imbalance and Representation Degeneration in Multilingual Machine Translation

Lexical-Constraint-Aware Neural Machine Translation via Data Augmentation