Abstract:Starting from the famous "Put That There!" demonstration prototype, developed by the Architecture Machine Group at MIT in the late 1970s, the growing potential of multimodal gesture interfaces in natural human-machine communication setups has stimulated people's imagination and motivated significant research efforts in the fields of computer vision, speech recognition, multimodal sensing, fusion, and human-computer interaction (HCI). In the words of Bolt [1980, p. 1]: "Because voice can be augmented with simultaneous pointing, the free usage of pronouns becomes possible, with a corresponding gain in naturalness and economy of expression. Conversely, gesture aided by voice gains precision in its power to reference." Multimodal gesture recognition lies at the heart of such interfaces. As also defined in the Glossary, the term refers to the complex computational task comprising three main modules: (a) tracking of human movements, primarily of the hands and arms, and recognition of characteristic such motion patterns; (b) detection of accompanying speech activity and recognition of what is spoken; and (c) combination of the available audio-visual information streams to identify the multimodally communicated message. To successfully perform such tasks, the original "Put That There!" system of Bolt [1980] imposed certain limitations on the interaction. Specifically, it required that the user be tethered by wearing a position sensing device on the wrist to capture gesturing and a headset microphone to record speech, and it allowed multimodal manipulation via speech and gestures of a small only set of shapes on a rather large screen (see also Figure 11.1). Since then, however, research efforts in the field of multimodal gesture recognition have moved beyond such limited scenarios, capturing and processing the multimodal data streams by employing distant audio and visual sensors that are unobtrusive to humans. In particular, in recent years, the introduction of affordable and compact multimodal sensors like the Microsoft Kinect has enabled robust capturing of human activity. This is due to the wealth of raw and metadata streams provided by the device, in addition to the traditional planar RGB video, such as depth scene information, multiple audio channels, and human skeleton and facial tracking, among others [Kinect 2016]. Such advancements have led to intensified efforts to integrate multimodal gesture interfaces in real-life applications. Indeed, the field of multimodal gesture recognition has been attracting increasing interest, being driven by novel HCI paradigms on a continuously expanding range of devices equipped with multimodal sensors and ever-increasing computational power, for example smartphones and smart television sets. Nevertheless, the capabilities of modern multimodal gesture systems remain limited. In particular, the set of gestures accounted for in typical setups is mostly constrained to pointing gestures, a number of emblematic ones like an open palm, and gestures corresponding to some sort of interaction with a physical object, e.g., pinching for zooming. At the same time, fusion with speech remains in most cases just an experimental feature. When compared to the abundance and variety of gestures and their interaction with speech in natural human communication, it clearly seems that there is still a long way to go for the corresponding HCI research and development [Kopp 2013]. Multimodal gesture recognition constitutes a wide multi-disciplinary field. This chapter makes an effort to provide a comprehensive overview of it, both in theoretical and application terms. More specifically, basic concepts related to gesturing, the multifaceted interplay of gestures and speech, and the importance of gestures in HCI are discussed in Section 11.2. An overview of the current trends in the field of multimodal gesture recognition is provided in Section 11.3, separately focusing on gestures, speech, and multimodal fusion. Furthermore, a state-of-the-art recognition setup developed by the authors is described in detail in Section 11.4, in order to facilitate a better understanding of all practical considerations involved in such a system. In closing, the future of multimodal gesture recognition and related challenges are discussed in Section 11.5. Finally, a set of Focus Questions to aid comprehension of the material is also provided.

Multimodal interaction: A review

A Review of Multimodal Interaction

Multimodal Systems: Taxonomy, Methods, and Challenges

A review on multimodal interaction in Mixed Reality Environment

Introduction: Multimodal interaction

Multisensory Integration as per Technological Advances: A Review

Review of Sensory Feedback Simulation Methods in Multi-modal Human-computer Interaction

Brain–Computer Interfaces for Multimodal Interaction: A Survey and Principles

Multimodal human–computer interaction in interventional radiology and surgery: a systematic literature review

Multimodal Interaction Systems Based on Internet of Things and Augmented Reality: A Systematic Literature Review

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

Perspectives on learning with multimodal technology

Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

Multimodal Human–Robot Interaction for Human‐Centric Smart Manufacturing: A Survey

Multimodal gesture recognition

Recent advancements in multimodal human-robot interaction

The Handbook of Multimodal-Multisensor Interfaces: Foundations, User Modeling, and Common Modality Combinations - Volume 1

Interaction With Gaze, Gesture, and Speech in a Flexibly Configurable Augmented Reality System

Multimodal Methods for Analyzing Learning and Training Environments: A Systematic Literature Review

Multimodal interaction and IoT applications

Multimodal Machine Learning: A Survey and Taxonomy