Multimodal Communication in LFG: Gestures and the Correspondence Architecture

Gianluca Giorgolo,A. Asudeh
Abstract:In this paper we describe how to integrate paralinguistic modes of communication in the LFG framework. In particular, we investigate how the communicative contribution of spontaneous hand gestures can be framed in the Correspondence Architecture, and how the same architecture provides certain substantial advantages over other approaches for the goal of interpreting hand gestures. The fact that verbal language and gestures co-operate in the exchange of information in a communicative setting is now well accepted thanks to a number of descriptive [Kendon, 2004] and psychological and neurological studies [Willems and Hagoort, 2007]. However formal attempts to specify the details of the process of crossmodal integration are still rare. We believe that LFG provides the right formal tools to describe the process that results in the intergrated interpretation of verbal language and gestures. Our starting point will be the theory of semantic interaction between verbal language and hand gestures presented in [Giorgolo, 2010]. The theory is restricted to representational gestures (so called iconic gestures) and is based on two principles: intersectivity and iconicity. According to Giorgolo, the different forms of interaction between the two modalities can be reconstructed in terms of these two primitives. Intersectivity acts as the compositional ‘glue’ that integrates the gestural meaning in the verbal assertion, and is implemented in the theory as a generalized meet operation. Iconicity represents the ‘real’ meaning of the gesture and takes the shape of an equivalence relation described by the virtual space depicted by the hands. In this way, Giorgolo associates with a gesture, as its primary meaning, a function that acts as a filter. This filter adds constraints on the interpretation of the referents that populate the frame of reference of the discourse. Giorgolo starts from the assumption that gesture meaning is an integral part of communication and is best modeled as interacting with the compositional semantic structure of the verbal fragment it accompanies. We will also keep this level as the central one for modeling the interaction between the two modalities, concentrating therefore on the Glue logic portion of the Correspondence Architecture. However we will also argue for the importance of connecting the gestural component with the functional structure of the sentence it accompanies. We will demonstrate how taking into account this additional level of information provides a better explanation for some phenomena that depend not only on the semantic content of verbal language, but also on the way this content is expressed in its surface realization. We will also demonstrate that employing a more abstract data structure like f-structure in relating gesture interpretation to verbal interpretation has advantages over other approaches that encode the necessary information in a data structure closer to its surface realization. To integrate the gestural contribution we switch from a verbal utterance perspective to one in which the two modes are aligned, which we will call a “multimodal utterance.” The architecture is shown in Figure 1, which leaves aside certain aspects of the Correspondence Architecture; see [Asudeh, 2011] for further discussion of this ‘pipeline’ version [Bogel et al., 2009] of the Correspondence Architecture. We leave the verbal component of the architecture untouched, by mapping from the multimodal utterance to the phonological string with the projection function υ. We assume that the multimodal utterance projects a gestural structure via a projection function γ. A gestural structure is a feature structure describing the physical appearance of the gesture (typical features include hand shape, trajectory, orientation, and so on). The gesture structure is then mapped, via the κ projection function, to a time structure, which is a time-indexed power set of the substrings in the phonological string. The time structure is populated by a function τ from the phonological string. Time structure then feeds c-structure, such that c-structure nodes are time-indexed, with the gesture added to a suitable node determined by the temporal indices. We follow [Alahverdzhieva and Lascarides, 2010] in not imposing any restriction on the category of the node to which a gesture can attach. In line with Alahverdzhieva and Lascarides and with Giorgolo, we assume that prosodic considerations guide the precise alignment between the two modalities. This is consistent with the approach of [Bogel et al., 2009]. Following Giorgolo, we take that there are multiple compatible attachment points for the gesture, all producing the same interpretation. The gestural structure, combined in this way with the cstructure, contributes to the f-structure as a co-head of the projection of the node that directly dominates the gesture. The “syntactic” behavior of gestures can then be captured by the following rule: iXj → hGk iXj ↑ = ↓ ↑ = ↓ where the time intervals [i, j] and [h, k] overlap, G is the category of gestural nodes and X is a metavariable for syntactic categories. Finally, we can take the ω projection to perform the mapping from the bundle of kinetic, physical features to what Giorgolo calls the virtual space, an abstract model theoretic object that forms the core of the gesture interpretation. To demonstrate the advantages offered by the projection architecture in modeling the integrated interpretation of gesture and speech, we will here briefly re-analyze an example presented in [Giorgolo, 2010], which is extracted from the Speech and Gesture Alignment Corpus [Lucking et al., 2010], a corpus of spontaneous conversations annotated for gestural information. In the example under discussion, the speaker describes a church with two towers and accompanies the utterance of the DP “zwei Turme” (“two towers”) with a gesture depicting some spatial information about the towers, namely that they are shaped like vertically oriented prisms and that the two towers are disconnected. This simple example presents some challenges to the intersective interpretation of gestures we assume. In this case we have, in fact, a mismatch between the number of regions the gesture characterizes and the number of entities the denotation of the noun presupposes. This mismatch prevents the use of the generalized meet operation that creates the joint
What problem does this paper attempt to address?