Local Byte Fusion for Neural Machine Translation

Makesh Narsimhan Sreedhar,Xiangpeng Wan,Yu Cheng,Junjie Hu
2023-06-28
Abstract:Subword tokenization schemes are the dominant technique used in current NLP models. However, such schemes can be rigid and tokenizers built on one corpus do not adapt well to other parallel corpora. It has also been observed that in multilingual corpora, subword tokenization schemes over-segment low-resource languages leading to a drop in translation performance. A simple alternative to subword tokenizers is byte-based methods i.e. tokenization into byte sequences using encoding schemes such as UTF-8. Byte tokens often represent inputs at a sub-character granularity i.e. one character can be represented by a sequence of multiple byte tokens. This results in byte sequences that are significantly longer than character sequences. Enforcing aggregation of local information in the lower layers can guide the model to build higher-level semantic information. We propose a Local Byte Fusion (LOBEF) method for byte-based machine translation -- utilizing byte $n$-gram and word boundaries -- to aggregate local semantic information. Extensive experiments on multilingual translation, zero-shot cross-lingual transfer, and domain adaptation reveal a consistent improvement over traditional byte-based models and even over subword techniques. Further analysis also indicates that our byte-based models are parameter-efficient and can be trained faster than subword models.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on byte - level representation and multilingual adaptability in Neural Machine Translation (NMT). Specifically: 1. **Problems with byte - level representation**: - Existing multilingual NMT models rely on pre - built sub - word tokenizers (such as BPE, SentencePiece), which lead to two main problems: First, the models lack flexibility when adapting to other parallel corpora; second, when the sub - word tokenizers are built on unbalanced multilingual data, words in low - resource languages will be over - segmented into multiple single characters, thus affecting translation performance. - Although character - level models have been proposed as a solution, they have significant overheads when dealing with long character sequences during training and inference, and still use a fixed character vocabulary, resulting in the same problems as the fixed sub - word tokenizers. 2. **Problems with multilingual adaptability**: - In a multilingual setting, characters of different languages may be tokenized into different numbers of bytes, which will produce byte sequences longer than the original sentences. Traditional byte - level NMT models can only implicitly reconstruct character - level or word - level representations in a completely data - driven manner, which may lead to poor interpretability and sub - optimal results. To solve these problems, the authors propose two Local Byte Fusion (LOBEF) techniques to explicitly aggregate byte - level information and learn character - level and word - level representations. These techniques include: - **n - gram Convolutional Fusion (nCF)**: Use four different 1 - D convolutional layers to aggregate n - gram byte tokens, thereby learning character - level information. - **Word - boundary - based Self - Attention Fusion (WSF)**: Utilize word - boundary information in the original sentence to aggregate byte tokens within words, thereby learning word - level context. Through these methods, the authors aim to improve the performance of byte - level NMT models in multilingual translation, zero - shot / few - shot cross - language transfer, and cross - domain adaptation tasks. Experimental results show that these methods can significantly improve translation quality in multiple languages and domains.