Abstract:The likelihood ratio paradigm for quantifying the strength of evidence has been researched in many fields of forensic science. Within this paradigm, score-based approaches for estimating likelihood ratios are becoming more prevalent in the forensic science literature. In this study, a score-based approach for estimating likelihood ratios is implemented for linguistic text evidence. Text data are represented via a bag-of-words model with the Z-score normalised relative frequencies of selected most-frequent words (the number of the most-frequent words = N), and the Euclidean, Manhattan and Cosine distance measures are trialled as the score-generating functions for comparing paired text samples. The score-to-likelihood-ratio conversion model was built using a common source method, and the best fitting model was selected from the parametric models of the Normal, Log-normal, Gamma and Weibull distributions. With the Amazon Product Data Authorship Verification Corpus, two groups of documents (each group including documents of approximately 700, 1400 and 2100 words) were synthesised for each author, allowing 720 same-author comparisons and 517,680 different-author comparisons to test the validity of the system. A series of experiments was conducted using combinations of the following conditions: the three score functions, the different values of N for the feature vector and the different document lengths. The validity of the system was assessed using the log-likelihood-ratio cost (Cllr), and the strength of the derived likelihood ratios was charted in the form of Tippett plots. It was demonstrated that 1) the Cosine measure consistently outperforms the other measures-the best performance is achieved with N = 260, regardless of the document length (e.g., Cllr values of 0.70640, 0.45314 and 0.30692, respectively, for 700, 1400 and 2100 words)-and 2) the derived likelihood ratios are very well calibrated irrespective of the distance measures and document lengths. A follow-up experiment showed that the described score-based approach is relatively robust and stable for a limited quantity of background data. The derived likelihood ratios that were estimated separately to the three distance measures were logistic regression fused; and the fusion achieved a further improvement in performance-for example, a Cllr of 0.23494 for 2100 words. This study demonstrates the possibility of designing likelihood ratio-based systems that discriminate between same-author and different-author documents.

Boosting word frequencies in authorship attribution

A Bayesian approach to uncertainty in word embedding bias estimation

Authorship Attribution through Function Word Adjacency Networks

Axiomatic Quantification of Co-authors' Relative Contributions

Score-based likelihood ratios for linguistic text evidence with a bag-of-words model

Solving Cosine Similarity Underestimation between High Frequency Words by L2 Norm Discounting

Moving beyond Kucera and Francis: a critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English

FRAGE: Frequency-Agnostic Word Representation

N-gram Boosting: Improving Contextual Biasing with Normalized N-gram Targets

Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings

A Geometric Counting Method Adaptive to the Author Number

Author contributions and allocation of authorship credit: testing the validity of different counting methods in the field of chemical biology

Enhancing ASR Performance through OCR Word Frequency Analysis: Theoretical Foundations

A model-independent redundancy measure for human versus ChatGPT authorship discrimination using a Bayesian probabilistic approach

Assessing Word Importance Using Models Trained for Semantic Tasks

Normalized Paper Credit Assignment: A Solution for the Ethical Dilemma Induced by Multiple Important Authors

Authorship attribution based on a probabilistic topic model

Maximum Entropy, Word-Frequency, Chinese Characters, and Multiple Meanings

Probabilistic Method of Measuring Linguistic Productivity

The Undesirable Dependence on Frequency of Gender Bias Metrics Based on Word Embeddings