A reproduction of Apple's bi-directional LSTM models for language identification in short strings

Mads Toftrup,Søren Asger Sørensen,Manuel R. Ciosici,Ira Assent
DOI: https://doi.org/10.48550/arXiv.2102.06282
2021-02-12
Abstract:Language Identification is the task of identifying a document's language. For applications like automatic spell checker selection, language identification must use very short strings such as text message fragments. In this work, we reproduce a language identification architecture that Apple briefly sketched in a blog post. We confirm the bi-LSTM model's performance and find that it outperforms current open-source language identifiers. We further find that its language identification mistakes are due to confusion between related languages.
Computation and Language,Machine Learning
What problem does this paper attempt to address?