GazePointAR: A Context-Aware Multimodal Voice Assistant for Pronoun Disambiguation in Wearable Augmented Reality

Jaewook Lee,Jun Wang,Elizabeth Brown,Liam Chu,Sebastian S. Rodriguez,Jon E. Froehlich
2024-04-12
Abstract:Voice assistants (VAs) like Siri and Alexa are transforming human-computer interaction; however, they lack awareness of users' spatiotemporal context, resulting in limited performance and unnatural dialogue. We introduce GazePointAR, a fully-functional context-aware VA for wearable augmented reality that leverages eye gaze, pointing gestures, and conversation history to disambiguate speech queries. With GazePointAR, users can ask "what's over there?" or "how do I solve this math problem?" simply by looking and/or pointing. We evaluated GazePointAR in a three-part lab study (N=12): (1) comparing GazePointAR to two commercial systems; (2) examining GazePointAR's pronoun disambiguation across three tasks; (3) and an open-ended phase where participants could suggest and try their own context-sensitive queries. Participants appreciated the naturalness and human-like nature of pronoun-driven queries, although sometimes pronoun use was counter-intuitive. We then iterated on GazePointAR and conducted a first-person diary study examining how GazePointAR performs in-the-wild. We conclude by enumerating limitations and design considerations for future context-aware VAs.
Human-Computer Interaction
What problem does this paper attempt to address?
This paper aims to address the challenges faced by Voice Assistants (VAs) in real-world applications, particularly the issues of unnatural conversations and performance limitations due to a lack of understanding of the user's spatiotemporal context. Specifically, the paper proposes a context-aware multimodal voice assistant named GazePointAR, which leverages eye-tracking, pointing gesture recognition, and conversation history to resolve ambiguities in pronoun-containing queries. GazePointAR is designed for wearable augmented reality devices, with the core goal of improving the human-computer interaction experience, allowing users to interact with the voice assistant in a more natural manner. For example, users can simply gaze or point and ask, "What is this?" or "How do I solve this math problem?" The system automatically understands the user's intent, converts the ambiguous pronoun-containing query into a clear statement, and then sends it to a large language model for processing. Finally, the answer is read out using text-to-speech technology. To validate the effectiveness and practicality of GazePointAR, the authors conducted three laboratory studies: 1. **Comparative Study**: GazePointAR was compared with two commercial voice assistant systems. 2. **Pronoun Disambiguation Study**: The ability of GazePointAR to disambiguate pronouns in different task scenarios was evaluated. 3. **Open Query Phase**: Participants were allowed to freely try various context-sensitive queries to better understand user needs and system limitations. Additionally, the research explored the performance of GazePointAR in real-world scenarios and listed design considerations and limitations for future context-aware voice assistants. Overall, this work not only presents a fully functional prototype system but also provides valuable insights on how to further optimize such assistants.