Abstract:Voice assistants (VAs) like Siri and Alexa are transforming human-computer interaction; however, they lack awareness of users' spatiotemporal context, resulting in limited performance and unnatural dialogue. We introduce GazePointAR, a fully-functional context-aware VA for wearable augmented reality that leverages eye gaze, pointing gestures, and conversation history to disambiguate speech queries. With GazePointAR, users can ask "what's over there?" or "how do I solve this math problem?" simply by looking and/or pointing. We evaluated GazePointAR in a three-part lab study (N=12): (1) comparing GazePointAR to two commercial systems; (2) examining GazePointAR's pronoun disambiguation across three tasks; (3) and an open-ended phase where participants could suggest and try their own context-sensitive queries. Participants appreciated the naturalness and human-like nature of pronoun-driven queries, although sometimes pronoun use was counter-intuitive. We then iterated on GazePointAR and conducted a first-person diary study examining how GazePointAR performs in-the-wild. We conclude by enumerating limitations and design considerations for future context-aware VAs.

What problem does this paper attempt to address?

This paper aims to address the challenges faced by Voice Assistants (VAs) in real-world applications, particularly the issues of unnatural conversations and performance limitations due to a lack of understanding of the user's spatiotemporal context. Specifically, the paper proposes a context-aware multimodal voice assistant named GazePointAR, which leverages eye-tracking, pointing gesture recognition, and conversation history to resolve ambiguities in pronoun-containing queries. GazePointAR is designed for wearable augmented reality devices, with the core goal of improving the human-computer interaction experience, allowing users to interact with the voice assistant in a more natural manner. For example, users can simply gaze or point and ask, "What is this?" or "How do I solve this math problem?" The system automatically understands the user's intent, converts the ambiguous pronoun-containing query into a clear statement, and then sends it to a large language model for processing. Finally, the answer is read out using text-to-speech technology. To validate the effectiveness and practicality of GazePointAR, the authors conducted three laboratory studies: 1. **Comparative Study**: GazePointAR was compared with two commercial voice assistant systems. 2. **Pronoun Disambiguation Study**: The ability of GazePointAR to disambiguate pronouns in different task scenarios was evaluated. 3. **Open Query Phase**: Participants were allowed to freely try various context-sensitive queries to better understand user needs and system limitations. Additionally, the research explored the performance of GazePointAR in real-world scenarios and listed design considerations and limitations for future context-aware voice assistants. Overall, this work not only presents a fully functional prototype system but also provides valuable insights on how to further optimize such assistants.

GazePointAR: A Context-Aware Multimodal Voice Assistant for Pronoun Disambiguation in Wearable Augmented Reality

Uncovering Post-Adoption Usage of AI-based Voice Assistants: a Technology Affordance Lens Using a Mixed-Methods Approach

Enhancing Mobile Voice Assistants with WorldGaze

GazeGPT: Augmenting Human Capabilities using Gaze-contingent Contextual AI for Smart Eyewear

Exploring Interactive Gestures with Voice Assistant on HMDs in Social Situations

ARGUS: Visualization of AI-Assisted Task Guidance in AR

GazeNoter: Co-Piloted AR Note-Taking via Gaze Selection of LLM Suggestions to Match Users' Intentions

Interaction With Gaze, Gesture, and Speech in a Flexibly Configurable Augmented Reality System

AtAwAR Translate: Attention-Aware Language Translation Application in Augmented Reality for Mobile Phones

Augmented Conversation with Embedded Speech-Driven On-the-Fly Referencing in AR

Preference in Voice Commands and Gesture Controls With Hands-Free Augmented Reality With Novel Users

G-VOILA: Gaze-Facilitated Information Querying in Daily Scenarios

Large Language Model-assisted Speech and Pointing Benefits Multiple 3D Object Selection in Virtual Reality

Understanding Multimodal User Gesture and Speech Behavior for Object Manipulation in Augmented Reality Using Elicitation

Project Aria: A New Tool for Egocentric Multi-Modal AI Research

Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling

An Assistive Low-Vision Platform That Augments Spatial Cognition Through Proprioceptive Guidance: Point-to-Tell-and-Touch

Losing Its Touch: Understanding User Perception of Multimodal Interaction and Smart Assistance

Eliciting Multimodal Gesture+Speech Interactions in a Multi-Object Augmented Reality Environment

Alexa, I Wanna See You: Envisioning Smart Home Assistants for the Deaf and Hard-of-Hearing

EmBARDiment: an Embodied AI Agent for Productivity in XR