Filtering and Extended Vocabulary based Translation for Low-resource Language Pair of Sanskrit-Hindi

Piyush Jha,Rashi Kumar,Vineet Sahula
DOI: https://doi.org/10.1145/3580495
IF: 1.471
2023-04-12
ACM Transactions on Asian and Low-Resource Language Information Processing
Abstract:Neural Machine Translation (NMT) is widely employed for language translation tasks because it performs better than the conventional statistical and phrase-based approaches. However, NMT techniques involve challenges, such as requiring a large and clean corpus of parallel data and the inability to deal with rare words. They need to be faster for real-time applications. More work needs to be done using NMT to address the challenges in translating Sanskrit, one of the oldest and rich languages known to the world, with its morphological richness and limited multilingual parallel corpus. There is usually no similar data between a language pair; hence, no application exists so far that can translate Sanskrit to/from other languages. This study presents an in-depth analysis to address these challenges with the help of a low-resource Sanskrit-Hindi language pair. We employ a novel training corpus filtering with extended vocabulary in a zero-shot transformer architecture. The structure of the Sanskrit language is thoroughly investigated to justify the use of each step. Furthermore, the proposed method is analyzed based on variations in sentence length and also applied to a high-resource language pair in order to demonstrate its efficacy.
computer science, artificial intelligence
What problem does this paper attempt to address?