BookWorm: A Dataset for Character Description and Analysis

Argyrios Papoudakis,Mirella Lapata,Frank Keller
2024-10-14
Abstract:Characters are at the heart of every story, driving the plot and engaging readers. In this study, we explore the understanding of characters in full-length books, which contain complex narratives and numerous interacting characters. We define two tasks: character description, which generates a brief factual profile, and character analysis, which offers an in-depth interpretation, including character development, personality, and social context. We introduce the BookWorm dataset, pairing books from the Gutenberg Project with human-written descriptions and analyses. Using this dataset, we evaluate state-of-the-art long-context models in zero-shot and fine-tuning settings, utilizing both retrieval-based and hierarchical processing for book-length inputs. Our findings show that retrieval-based approaches outperform hierarchical ones in both tasks. Additionally, fine-tuned models using coreference-based retrieval produce the most factual descriptions, as measured by fact- and entailment-based metrics. We hope our dataset, experiments, and analysis will inspire further research in character-based narrative understanding.
Computation and Language,Artificial Intelligence,Information Retrieval,Machine Learning
What problem does this paper attempt to address?
This paper attempts to address the challenges of understanding and analyzing characters in long novels. Specifically, the authors focus on the following aspects: 1. **Complex relationships and interactions**: Long novels typically contain a large number of characters with complex interrelationships and interactions, which are crucial for understanding the story. 2. **Dynamically changing characters**: Unlike short stories, characters in long novels are usually dynamically developed, with their personalities, motivations, and relationships evolving as the plot progresses. 3. **Technical challenges**: The length of long novels exceeds the input length that many current Transformer-based architectures can handle, making it technically challenging to address these issues. To address these problems, the authors propose two tasks: - **Character Description**: Generate a concise factual character profile, including the character's actions, relationships, and attributes. - **Character Analysis**: Provide an in-depth character explanation, including the character's personality development, motivations, and social background. Additionally, the authors introduce a new dataset **BOOK WORM**, which pairs books from the Gutenberg Project with human-written character descriptions and analyses, to evaluate the performance of existing models on these two tasks. Through this dataset, the authors hope to inspire more research on character-based narrative understanding.