Data modelling in corpus linguistics: How low may we go?

Marjolein H. van Velzen,Luca Nanetti,Peter P. de Deyn
DOI: https://doi.org/10.1016/j.cortex.2013.10.010
IF: 4.644
2014-06-01
Cortex
Abstract:Corpus linguistics allows researchers to process millions of words. However, the more words we analyse, i.e., the more data we acquire, the more urgent the call for correct data interpretation becomes. In recent years, a number of studies saw the light attempting to profile some prolific authors' linguistic decline, linking this decline to pathological conditions such as Alzheimer's Disease (AD). However, in line with the nature of the (literary) work that was analysed, numbers alone do not suffice to 'tell the story'. The one and only objective of using statistical methods for the analysis of research data is to tell a story--what happened, when, and how. In the present study we describe a computerised but individualised approach to linguistic analysis--we propose a unifying approach, with firm grounds in Information Theory, that, independently from the specific parameter being investigated, guarantees to produce a robust model of the temporal dynamics of an author's linguistic richness over his or her lifetime. We applied this methodology to six renowned authors with an active writing life of four decades or more: Iris Murdoch, Gerard Reve, Hugo Claus, Agatha Christie, P.D. James, and Harry Mulisch. The first three were diagnosed with probable Alzheimer Disease, confirmed post-mortem for Iris Murdoch; this same condition was hypothesized for Agatha Christie. Our analysis reveals different evolutive patterns of lexical richness, in turn plausibly correlated with the authors' different conditions.
behavioral sciences,psychology, experimental,neurosciences
What problem does this paper attempt to address?