Abstract:While quantitative methods have been used to examine changes in word usage in books, studies have focused on overall trends, such as the shapes of narratives, which are independent of book length. We instead look at how words change over the course of a book as a function of the number of words, rather than the fraction of the book, completed at any given point; we define this measure as "cumulative word-time". Using ousiometrics, a reinterpretation of the valence-arousal-dominance framework of meaning obtained from semantic differentials, we convert text into time series of power and danger scores in cumulative word-time. Each time series is then decomposed using empirical mode decomposition into a sum of constituent oscillatory modes and a non-oscillatory trend. By comparing the decomposition of the original power and danger time series with those derived from shuffled text, we find that shorter books exhibit only a general trend, while longer books have fluctuations in addition to the general trend. These fluctuations typically have a period of a few thousand words regardless of the book length or library classification code, but vary depending on the content and structure of the book. Our findings suggest that, in the ousiometric sense, longer books are not expanded versions of shorter books, but are more similar in structure to a concatenation of shorter texts. Further, they are consistent with editorial practices that require longer texts to be broken down into sections, such as chapters. Our method also provides a data-driven denoising approach that works for texts of various lengths, in contrast to the more traditional approach of using large window sizes that may inadvertently smooth out relevant information, especially for shorter texts. These results open up avenues for future work in computational literary analysis, particularly the measurement of a basic unit of narrative.

Two halves of a meaningful text are statistically different

Relating Zipf's law to textual information

Entropy in Different Text Types.

A decomposition of book structure through ousiometric fluctuations in cumulative word-time

Quantifying the Dissimilarity of Texts

Comparison study of using semantic and syntactic network characteristics to do text clustering

Thematic Concentration As a Discriminating Feature of Text Types

CompText: Visualizing, Comparing & Understanding Text Corpus

Comparative Computational Analysis of Global Structure in Canonical, Non-Canonical and Non-Literary Texts

A Comparative Analysis of Temporal Long Text Similarity: Application to Financial Documents

Differential Analysis of Stylistic Features in Chinese-English Interpretation Based on Natural Language Processing

Study on the Differences in the Language Styles of Dream of Red Mansions Based on the Statistics of Lexical and Syntactic Features

Are Daojing and Dejing Stylistically Independent of Each Other: A Stylometric Analysis with Activity and Descriptivity

A World of Difference: Divergent Word Interpretations among People

Mastering the Measurement of Text's Frequency Structure: an Investigation on Lambda's Reliability.

Document Similarity for Texts of Varying Lengths via Hidden Topics

Content Significance Distribution of Sub-Text Blocks in Articles and Its Application to Article-Organization Assessment

Multi-Level Difference Analysis of Written Discourse Based on Word Embedding

Finding Semantic Equivalence of Text Using Random Index Vectors.

A Novel Discrimination Structure for Assessing Text Semantic Similarity

Probing the topological properties of complex networks modeling short written texts