Using cognitive models to understand multimodal processes: the case for speech and gesture production
Stefan Kopp,Kirsten Bergmann
DOI: https://doi.org/10.1145/3015783.3015791
2017-04-24
Abstract:Multimodal behavior has been studied for a long time and in many fields, e.g., in psychology, linguistics, communication studies, education, and ergonomics. One of the main motivations has been to allow humans to use technical systems intuitively, in a way that resembles and fosters human users' natural way of interacting and thinking [Oviatt 2013]. This has sparked early work on multimodal human-computer interfaces, including recent approaches to recognize communicative behavior and even subtle multimodal cues by computer systems. Those approaches, for the most part, rest on machine learning techniques applied to large sets of behavioral data. As datasets grow larger in size and coverage, and computational power increases, suitable data-driven techniques are able to detect correlational behavior patterns that support answering questions like which feature( s) to take into account or how to recognize them in specific contexts. However, natural multimodal interaction in humans entails a plethora of behavioral variations and intricacies (e.g., when to act unimodally vs. multimodally, with which specific behaviors or multi-level coordination between them). Possible underlying patterns are hard to detect, even in large datasets, and often such variations are attributed to context-dependencies or individual differences. How they come about is still hard to explain at the behavioral level. One additional level of explanation that can help to deepen our understanding, and to establish systematic and generalized accounts, involves cognitive processes that lead to a particular multimodal response in a specific situation (e.g., see Chapter 2; [James et al. 2017]). A prominent example is the concept of cognitive load or "mental workload." Many behavioral variations in spoken language have been meaningfully explained in terms of heightened or lowered cognitive load of the speaker. For example, under high cognitive load speakers are found to speak more slowly, to produce more silent or filled pauses, and to utter more repetitions [Jameson et al. 2010]. Likewise, human users distribute information across multiple modalities in order to manage their cognitive limits [Oviatt et al. 2004, Chen et al. 2012]. In studies with elementary school children as well as adults, active manual gesturing was demonstrated to improve memory during a task that required explaining math solutions [Goldin-Meadow et al. 2001]. This effect of gesturing increased with higher task difficulty. Such behavioral phenomena are commonly explained based on cognitive concepts like cognitive load and, further, underlying processes like modality-specific working memories [Baddeley 1992] or competition for cognitive resources [Wickens et al. 1983]. Cognitive theories also provide valuable hints as to how to design multimodal human-machine interaction. To continue with the above example, it has been suggested that certain multimodal interfaces help users to minimize their cognitive load and hence improve their performance. For instance, the physical activity of manual or pen-based gesturing can play a particularly important role in organizing and facilitating people's spatial information processing, which has been shown to reduce cognitive load on tasks involving geometry, maps, and similar areas [Alibali et al. 2000, Oviatt 1997]. Other research revealed that expressively powerful interfaces not only help to cope with interaction problems, but also substantially facilitate human cognition by functioning as "thinking tools" [Oviatt 2013]. It is for these reasons that "Cognitive Science has and will continue to play an essential role in guiding the design of multimodal systems" [Oviatt and Cohen 2015, p. 33]. One particularly explicit form of a cognitive account is a cognitive model. In general, a cognitive model is a simplified, schematic description of cognitive processes for the purposes of understanding or predicting a certain behavior. This approach is markedly pursued in the field of cognitive modeling, which developed at the intersection of cognitive psychology, cognitive science, and artificial intelligence. In these fields, cognitive models are primarily developed within generic cognitive architectures like ACT-R [Anderson et al. 2004] or SOAR [Laird 2012], to name two prominent examples, which capture generally assumed structural and functional properties of the human mind. Yet, the term cognitive model is not restricted to the use of such architectures. It can aptly be used whenever a specific notion of cognitive or mental processes is provided in computational terms that afford simulation-based examination and evaluation. This may include symbolic, hybrid, connectionist or dynamical models [cf. Polk and Seifert 2002], and it has been proposed for single cognitive tasks (e.g., memorizing of items), for the interaction of two or more processes (e.g., visual search and language comprehension), or for making specific behavioral predictions (e.g., driving under the influence of alcohol). In this chapter, we discuss how computational cognitive models can be useful for the field of multimodal and multisensory interaction. A number of arguments and prospects readily can be identified: from a basic research point of view, a cognitive model can represent a deeper level of understanding in terms of processes and mechanisms that underlie a certain behavior. Such an explanation will have to be hypothetical to some extent, but it potentially bears great scientific value for a number of reasons. First, it is more specific and detailed than most psychological theories (as discussed in Section 6.3). Second, it is predictive rather than purely descriptive, hence affording rigorous evaluation and falsification, e.g., by deriving quantitative predictions in computational simulations and assessing those against empirical data. Finally, cognitive models can provide a common level of description that enables relating and combining insights from different fields of research. For example, they may draw on general findings from working memory or attention to address the question of multimodal or multisensory interaction. From an engineering point of view, cognitive models can help to build better multimodal systems. This holds especially true for cases where data is lacking and inspiration needs to be drawn from theoretical concepts. A computational cognitive model that has proven itself useful in evaluation can provide principles and criteria for the development of algorithms and systems. For example, it may be employed in computational simulations to actively explore distributions of behavioral variations, in order to produce additional training data or help confine a domain to its relevant contingencies, thresholds, or functions. Second, a cognitive model should provide an informed model of the human user and thus can guide the design of multimodal interaction systems. For instance, recognizing and interpreting the user's cognitive processes provides a basis for more adequate system adaptation (a capability that is becoming increasingly important, cf. Section 6.3). Simple assumptions are not realistic here. Cognitive models can be used to formulate and test more detailed and substantiated notions of human processing to improve the design of system algorithms. Likewise, a cognitive model may support identifying situations in which multimodal interaction can enable users to achieve their goals more effectively (e.g., with fewer errors or less cognitive load), where consequences of multimodality for the cognitive processing can be more or less directly observed in the simulations. Clearly, many of the listed potential benefits of computational cognitive models verge on or even go beyond the current state of established knowledge. Thus, we will focus our discussion on one case of natural multimodal behavior that has been extensively researched in this regard, the use of spontaneous speech and gesture in dialogue. We will start by reviewing speech and gesture as a pervasive case of natural multimodal behavior. We will motivate its relevance for practical multimodal interfaces, virtual characters, or social robotics (Section 6.2). Then we will discuss existing theoretical and computational models of the cognitive underpinnings (Section 6.3), and we will elaborate on one particular cognitive model of speech-gesture production that explicates the role of mental representation and memory processes up to a degree that does afford computational simulation under varying conditions (Section 6.4). Using this model, we demonstrate in Section 6.5 how cognitive modeling can be used to gain a better understanding of multimodal production processes and to inform the design of multimodal interactive systems. As an aid to comprehension, readers are referred to this chapter's Focus Questions and to the Glossary for a definition of terminology.