Abstract:Alzheimer's dementia (AD) entails negative psychological, social, and economic consequences not only for the patients but also for their families, relatives, and society in general. Despite the significance of this phenomenon and the importance for an early diagnosis, there are still limitations. Specifically, the main limitation is pertinent to the way the modalities of speech and transcripts are combined in a single neural network. Existing research works add/concatenate the image and text representations, employ majority voting approaches or average the predictions after training many textual and speech models separately. To address these limitations, in this article we present some new methods to detect AD patients and predict the Mini-Mental State Examination (MMSE) scores in an end-to-end trainable manner consisting of a combination of BERT, Vision Transformer, Co-Attention, Multimodal Shifting Gate, and a variant of the self-attention mechanism. Specifically, we convert audio to Log-Mel spectrograms, their delta, and delta-delta (acceleration values). First, we pass each transcript and image through a BERT model and Vision Transformer, respectively, adding a co-attention layer at the top, which generates image and word attention simultaneously. Secondly, we propose an architecture, which integrates multimodal information to a BERT model via a Multimodal Shifting Gate. Finally, we introduce an approach to capture both the inter- and intra-modal interactions by concatenating the textual and visual representations and utilizing a self-attention mechanism, which includes a gate model. Experiments conducted on the ADReSS Challenge dataset indicate that our introduced models demonstrate valuable advantages over existing research initiatives achieving competitive results in both the AD classification and MMSE regression task. Specifically, our best performing model attains an accuracy of 90.00% and a Root Mean Squared Error (RMSE) of 3.61 in the AD classification task and MMSE regression task, respectively, achieving a new state-of-the-art performance in the MMSE regression task.

Combining Prosodic, Voice Quality and Lexical Features to Automatically Detect Alzheimer's Disease

Identification of Alzheimer's Disease Patients Based on Oral Speech Features

Influence of ASR and Language Model on Alzheimer's Disease Detection

Alzheimer's Dementia Detection from Audio and Text Modalities

Temporal Integration of Text Transcripts and Acoustic Features for Alzheimer's Diagnosis Based on Spontaneous Speech

Classifying Alzheimer's Disease Using Audio and Text-Based Representations of Speech

Automatic Identification of Alzheimer's Disease using Lexical Features extracted from Language Samples

Exploring linguistic feature and model combination for speech recognition based automatic AD detection

Alzheimer's Dementia Recognition Using Acoustic, Lexical, Disfluency and Speech Pause Features Robust to Noisy Inputs

Alzheimer's Dementia Recognition through Spontaneous Speech: The ADReSS Challenge

Automated Classification of Cognitive Decline and Probable Alzheimer's Dementia Across Multiple Speech and Language Domains

Multimodal Deep Learning Models for Detecting Dementia From Speech and Transcripts

Cross-lingual Alzheimer's Disease detection based on paralinguistic and pre-trained features

Explainable Alzheimer's Disease Detection Using Linguistic Features from Automatic Speech Recognition

Leveraging Pretrained Representations with Task-related Keywords for Alzheimer's Disease Detection

The Unreliability of Acoustic Systems in Alzheimer's Speech Datasets with Heterogeneous Recording Conditions

Identification of Cognitive Decline from Spoken Language through Feature Selection and the Bag of Acoustic Words Model

Multi-modal fusion with gating using audio, lexical and disfluency features for Alzheimer's Dementia recognition from spontaneous speech

Multimodal fusion for alzheimer’s disease recognition

Towards Computer-Based Automated Screening of Dementia Through Spontaneous Speech