User Interaction Patterns and Breakdowns in Conversing with LLM-Powered Voice Assistants

Amama Mahmood,Junxiang Wang,Bingsheng Yao,Dakuo Wang,Chien-Ming Huang
DOI: https://doi.org/10.1016/j.ijhcs.2024.103406
2024-11-29
Abstract:Conventional Voice Assistants (VAs) rely on traditional language models to discern user intent and respond to their queries, leading to interactions that often lack a broader contextual understanding, an area in which Large Language Models (LLMs) excel. However, current LLMs are largely designed for text-based interactions, thus making it unclear how user interactions will evolve if their modality is changed to voice. In this work, we investigate whether LLMs can enrich VA interactions via an exploratory study with participants (N=20) using a ChatGPT-powered VA for three scenarios (medical self-diagnosis, creative planning, and discussion) with varied constraints, stakes, and objectivity. We observe that LLM-powered VA elicits richer interaction patterns that vary across tasks, showing its versatility. Notably, LLMs absorb the majority of VA intent recognition failures. We additionally discuss the potential of harnessing LLMs for more resilient and fluid user-VA interactions and provide design guidelines for tailoring LLMs for voice assistance.
Human-Computer Interaction
What problem does this paper attempt to address?
The problems that this paper attempts to solve are as follows: Currently, voice assistants (VAs) rely on traditional language models for interaction, and these models have deficiencies in understanding user intentions and maintaining coherent multi - turn conversations. With the development of large - language models (LLMs), they perform excellently in text generation and context understanding, but are mainly designed for text - based interactions. Therefore, it is unclear how the interaction between users and LLM - driven voice assistants will evolve when the interaction mode changes from text to voice. Specifically, the paper aims to explore the following two questions: 1. **New interaction modes**: When users interact with LLM - driven voice assistants, will new and unique interaction modes different from single - turn inquiries emerge? 2. **Reducing errors and conversation interruptions**: Can the context - understanding ability of LLMs help reduce the errors and conversation interruptions common in current commercial voice assistants? To answer these questions, researchers conducted an exploratory study. By having participants (N = 20) use a ChatGPT - driven voice assistant to complete tasks in three different scenarios (medical self - diagnosis, creative planning, and discussion), they observed the users' interaction patterns and possible conversation interruptions. ### Research background Traditional voice assistants such as Alexa and Siri rely on traditional language models and mainly use a rule - based keyword recognition mechanism to determine user intentions. This makes it difficult for them to maintain coherent multi - turn conversations and they are vulnerable to inevitable errors (such as transcription errors and intention recognition errors). In contrast, large - language models (LLMs) have the ability to generate coherent and context - aware text and can show great potential in various text - centered applications, such as healthcare, education, and collaborative writing. However, empirical research on the interaction between users and LLM - driven voice assistants is still limited. ### Research method Researchers first integrated ChatGPT into Alexa skills and designed a conversation framework to handle ChatGPT API latency and Alexa timeout issues. Then, they conducted an exploratory qualitative study, having 20 participants interact with this ChatGPT - driven voice assistant. The tasks included medical self - diagnosis, creative travel planning, and discussing with opinionated AI. Through thematic analysis, researchers discovered common and scenario - specific interaction patterns. ### Main contributions 1. **Interaction patterns**: Demonstrated the diverse interaction patterns of people with LLM - driven voice assistants in different scenarios and presented the conversation recovery patterns initiated by voice assistants and users. 2. **Opportunities and challenges**: Discussed the advantages (such as context retention, adaptability, and reduction of conversation interruptions) and limitations (such as repetitiveness, over - sharing, and differences in mental models) of LLM - driven voice assistants. 3. **Design guidelines**: Provided design guidelines for adapting text - centered LLMs to voice interactions, such as adopting a hierarchical response structure, redesigning voice assistant prompts, and balancing advantages and challenges. Through this study, the authors hope to provide valuable insights for understanding and improving future LLM - driven voice assistants.