Abstract:Alzheimer's dementia (AD) entails negative psychological, social, and economic consequences not only for the patients but also for their families, relatives, and society in general. Despite the significance of this phenomenon and the importance for an early diagnosis, there are still limitations. Specifically, the main limitation is pertinent to the way the modalities of speech and transcripts are combined in a single neural network. Existing research works add/concatenate the image and text representations, employ majority voting approaches or average the predictions after training many textual and speech models separately. To address these limitations, in this article we present some new methods to detect AD patients and predict the Mini-Mental State Examination (MMSE) scores in an end-to-end trainable manner consisting of a combination of BERT, Vision Transformer, Co-Attention, Multimodal Shifting Gate, and a variant of the self-attention mechanism. Specifically, we convert audio to Log-Mel spectrograms, their delta, and delta-delta (acceleration values). First, we pass each transcript and image through a BERT model and Vision Transformer, respectively, adding a co-attention layer at the top, which generates image and word attention simultaneously. Secondly, we propose an architecture, which integrates multimodal information to a BERT model via a Multimodal Shifting Gate. Finally, we introduce an approach to capture both the inter- and intra-modal interactions by concatenating the textual and visual representations and utilizing a self-attention mechanism, which includes a gate model. Experiments conducted on the ADReSS Challenge dataset indicate that our introduced models demonstrate valuable advantages over existing research initiatives achieving competitive results in both the AD classification and MMSE regression task. Specifically, our best performing model attains an accuracy of 90.00% and a Root Mean Squared Error (RMSE) of 3.61 in the AD classification task and MMSE regression task, respectively, achieving a new state-of-the-art performance in the MMSE regression task.

Detecting Alzheimer’s Disease from Speech Using Neural Networks with Bottleneck Features and Data Augmentation

Identification of Alzheimer's Disease Patients Based on Oral Speech Features

Exploring Multimodal Approaches for Alzheimer's Disease Detection Using Patient Speech Transcript and Audio Data

A Transfer Learning Method for Detecting Alzheimer's Disease Based on Speech and Natural Language Processing

Exploiting Pre-Trained ASR Models for Alzheimer's Disease Recognition Through Spontaneous Speech

Noninvasive automatic detection of Alzheimer's disease from spontaneous speech: a review

Detecting Alzheimer's Disease Based on Acoustic Features Extracted from Pre-trained Models

Leveraging Pretrained Representations with Task-related Keywords for Alzheimer's Disease Detection

Leveraging Pretrained Representations with Task-Related Keywords for Alzheimer’s Disease Detection

Artificial Intelligence-Enabled End-To-End Detection and Assessment of Alzheimer's Disease Using Voice

Exploiting Fully Convolutional Network and Visualization Techniques on Spontaneous Speech for Dementia Detection

Classifying Alzheimer's Disease Using Audio and Text-Based Representations of Speech

An approach for assisting diagnosis of Alzheimer's disease based on natural language processing

Exploring the Topics of Audio Words for Detecting Alzheimer's Disease from Spontaneous Speech

Towards Computer-Based Automated Screening of Dementia Through Spontaneous Speech

Temporal Integration of Text Transcripts and Acoustic Features for Alzheimer's Diagnosis Based on Spontaneous Speech

Explainable Alzheimer's Disease Detection Using Linguistic Features from Automatic Speech Recognition

ML-Based Analysis to Identify Speech Features Relevant in Predicting Alzheimer's Disease

Multimodal Deep Learning Models for Detecting Dementia From Speech and Transcripts

Exploring linguistic feature and model combination for speech recognition based automatic AD detection

Detecting Linguistic Characteristics of Alzheimer's Dementia by Interpreting Neural Models