Abstract:Purpose: As humans convey information about emotions by speech signals, emotion recognition via auditory information is often employed to assess one's affective states. There are numerous ways of applying the knowledge of emotional vocal expressions to system designs that accommodate users' needs adequately. Yet, little is known about how people with visual disabilities infer emotions from speech stimuli, especially via online platforms (e.g., Zoom). This study focussed on examining the degree to which they perceive emotions strongly or weakly, i.e., perceived intensity but also investigating the degree to which their sociodemographic backgrounds affect them perceiving different intensity levels of emotions when exposed to a set of emotional speech stimuli via Zoom. Materials and methods: A convenience sample of 30 individuals with visual disabilities participated in zoom interviews. Participants were given a set of emotional speech stimuli and reported the intensity level of the perceived emotions on a rating scale from 1 (weak) to 8 (strong). Results: When the participants were exposed to the emotional speech stimuli, calm, happy, fearful, sad, and neutral, they reported that neutral was the dominant emotion they perceived with the greatest intensity. Individual differences were also observed in the perceived intensity of emotions, associated with sociodemographic backgrounds, such as health, vision, job, and age. Conclusions: The results of this study are anticipated to contribute to the fundamental knowledge that will be helpful for many stakeholders such as voice technology engineers, user experience designers, health professionals, and social workers providing support to people with visual disabilities.IMPLICATIONS FOR REHABILITATIONTechnologies equipped with alternative user interfaces (e.g., Siri, Alexa, and Google Voice Assistant) meeting the needs of people with visual disabilities can promote independent living and quality of life.Such technologies can also be equipped with systems that can recognize emotions via users' voice, such that users can obtain services customized to fit their emotional needs or adequately address their emotional challenges (e.g., early detection of onset, provision of advice, and so on).The results of this study can be beneficial to health professionals (e.g., social workers) who work closely with clients who have visual disabilities (e.g., virtual telehealth sessions) as they could gain insights or learn how to recognize and understand the clients' emotional struggle by hearing their voice, which is contributing to enhancement of emotional intelligence. Thus, they can provide better services to their clients, leading to building a strong bond and trust between health professionals and clients with visual disabilities even they meet virtually (e.g., Zoom).

Visual Facial Enhancements Can Significantly Improve Speech Perception in the Presence of Noise

The effects of temporal cues, point-light displays, and faces on speech identification and listening effort

EXPRESS: Prior multisensory learning can facilitate auditory-only voice-identity and speech recognition in noise

The Effect on Speech-in-Noise Perception of Real Faces and Synthetic Faces Generated with either Deep Neural Networks or the Facial Action Coding System

Speech-Driven Facial Animations Improve Speech-in-Noise Comprehension of Humans

Decreasing hearing ability does not lead to improved visual speech extraction as revealed in a neural speech tracking paradigm

Vision Perceptually Restores Auditory Spectral Dynamics in Speech

Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis

The impact of face coverings on audio-visual contributions to communication with conversational speech

Concurrent talking in immersive virtual reality: on the dominance of visual speech cues

Spatial alignment between faces and voices improves selective attention to audio-visual speech

Multisensory benefits for speech recognition in noisy environments

Validating a Method to Assess Lipreading, Audiovisual Gain, and Integration During Speech Reception With Cochlear-Implanted and Normal-Hearing Subjects Using a Talking Head

Vision-referential speech enhancement of an audio signal using mask information captured as visual data

A Wearable Vision-To-Audio Sensory Substitution Device for Blind Assistance and the Correlated Neural Substrates

Effect of acoustic scene complexity and visual scene representation on auditory perception in virtual audio-visual environments

Synthetic faces generated with the facial action coding system or deep neural networks improve speech-in-noise perception, but not as much as real faces

Context-Aware Audio-Visual Speech Enhancement Based on Neuro-Fuzzy Modeling and User Preference Learning

Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert

Head-Mounted Display Visualizations to Support Sound Awareness for the Deaf and Hard of Hearing

Differences of people with visual disabilities in the perceived intensity of emotion inferred from speech of sighted people in online communication settings