Abstract:Language identification is a critical component of language processing pipelines (Jauhiainen et al.,2019) and is not a solved problem in real-world settings. We present a lightweight and effective language identifier that is robust to changes of domain and to the absence of copious training data. The key idea for classification is that the reciprocal of the rank in a frequency table makes an effective additive feature score, hence the term Reciprocal Rank Classifier (RRC). The key finding for language classification is that ranked lists of words and frequencies of characters form a sufficient and robust representation of the regularities of key languages and their orthographies. We test this on two 22-language data sets and demonstrate zero-effort domain adaptation from a Wikipedia training set to a Twitter test set. When trained on Wikipedia but applied to Twitter the macro-averaged F1-score of a conventionally trained SVM classifier drops from 90.9% to 77.7%. By contrast, the macro F1-score of RRC drops only from 93.1% to 90.6%. These classifiers are compared with those from fastText and langid. The RRC performs better than these established systems in most experiments, especially on short Wikipedia texts and Twitter. The RRC classifier can be improved for particular domains and conversational situations by adding words to the ranked lists. Using new terms learned from such conversations, we demonstrate a further 7.9% increase in accuracy of sample message classification, and 1.7% increase for conversation classification. Surprisingly, this made results on Twitter data slightly worse. The RRC classifier is available as an open source Python package (<a class="link-external link-https" href="https://github.com/LivePersonInc/lplangid" rel="external noopener nofollow">this https URL</a>).

LexFindR: A fast, simple, and extensible R package for finding similar words in a lexicon

FastLexRank: Efficient Lexical Ranking for Structuring Social Media Posts

LDAPrototype: a model selection algorithm to improve reliability of latent Dirichlet allocation

Language Identification with a Reciprocal Rank Classifier

LexBoost: Improving Lexical Document Retrieval with Nearest Neighbors

LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation

Nearest Neighbor Search over Vectorized Lexico-Syntactic Patterns for Relation Extraction from Financial Documents

SimLex-999: Evaluating Semantic Models with (Genuine) Similarity Estimation

Similar Word Model for Unfrequent Word Enhancement in Speech Recognition

The English Sublexical Toolkit: Methods for indexing sound–spelling consistency

Low-frequency word enhancement with similar pairs in speech recognition

BMX: Entropy-weighted Similarity and Semantic-enhanced Lexical Search

Portable BLAST-like algorithm library and its implementations for command line, Python, and R

RIscoper 2.0: a deep learning tool to extract RNA biomedical relation sentences from literature

Toward a broader – but still rigorous – definition of leader integrity: Commentary

SLiCK: Exploiting Subsequences for Length-Constrained Keyword Spotting

Lexicon-Based Approach to Sentiment Analysis of Tweets Using R Language

Prompt-based Personality Profiling: Reinforcement Learning for Relevance Filtering

Learning Word Ratings for Empathy and Distress from Document-Level User Responses

ProLex: A Benchmark for Language Proficiency-oriented Lexical Substitution

Assessing Accuracy: A Study of Lexicon and Rule-Based Packages in R and Python for Sentiment Analysis