Abstract:Alzheimer's dementia (AD) entails negative psychological, social, and economic consequences not only for the patients but also for their families, relatives, and society in general. Despite the significance of this phenomenon and the importance for an early diagnosis, there are still limitations. Specifically, the main limitation is pertinent to the way the modalities of speech and transcripts are combined in a single neural network. Existing research works add/concatenate the image and text representations, employ majority voting approaches or average the predictions after training many textual and speech models separately. To address these limitations, in this article we present some new methods to detect AD patients and predict the Mini-Mental State Examination (MMSE) scores in an end-to-end trainable manner consisting of a combination of BERT, Vision Transformer, Co-Attention, Multimodal Shifting Gate, and a variant of the self-attention mechanism. Specifically, we convert audio to Log-Mel spectrograms, their delta, and delta-delta (acceleration values). First, we pass each transcript and image through a BERT model and Vision Transformer, respectively, adding a co-attention layer at the top, which generates image and word attention simultaneously. Secondly, we propose an architecture, which integrates multimodal information to a BERT model via a Multimodal Shifting Gate. Finally, we introduce an approach to capture both the inter- and intra-modal interactions by concatenating the textual and visual representations and utilizing a self-attention mechanism, which includes a gate model. Experiments conducted on the ADReSS Challenge dataset indicate that our introduced models demonstrate valuable advantages over existing research initiatives achieving competitive results in both the AD classification and MMSE regression task. Specifically, our best performing model attains an accuracy of 90.00% and a Root Mean Squared Error (RMSE) of 3.61 in the AD classification task and MMSE regression task, respectively, achieving a new state-of-the-art performance in the MMSE regression task.

Multimodal fusion for alzheimer’s disease recognition

Identification of Alzheimer's Disease Patients Based on Oral Speech Features

A feature-aware multimodal framework with auto-fusion for Alzheimer's disease diagnosis

Alzheimer's Dementia Recognition Using Acoustic, Lexical, Disfluency and Speech Pause Features Robust to Noisy Inputs

Exploring Multimodal Approaches for Alzheimer's Disease Detection Using Patient Speech Transcript and Audio Data

A Multimodal Approach for Dementia Detection from Spontaneous Speech with Tensor Fusion Layer

Early diagnosis of Alzheimer's Disease based on multi-attention mechanism

Multimodal Deep Learning Models for Detecting Dementia From Speech and Transcripts

Temporal Integration of Text Transcripts and Acoustic Features for Alzheimer's Diagnosis Based on Spontaneous Speech

Deep learning and multimodal feature fusion for the aided diagnosis of Alzheimer's disease

Multimodal Identification of Alzheimer's Disease: A Review

Toward Robust Early Detection of Alzheimer's Disease via an Integrated Multimodal Learning Approach

Alzheimer's Disease Detection Model Based on Multimodal Data Early Fusion of Medical Neuroimaging

Detecting Alzheimer's Disease Based on Acoustic Features Extracted from Pre-trained Models

Leveraging Pretrained Representations with Task-related Keywords for Alzheimer's Disease Detection

Multimodal deep learning models for early detection of Alzheimer’s disease stage

Leveraging Pretrained Representations with Task-Related Keywords for Alzheimer’s Disease Detection

Exploring linguistic feature and model combination for speech recognition based automatic AD detection

Exploiting Pre-Trained ASR Models for Alzheimer's Disease Recognition Through Spontaneous Speech

Multimodal Inductive Transfer Learning for Detection of Alzheimer's Dementia and its Severity

Multi-modal fusion with gating using audio, lexical and disfluency features for Alzheimer's Dementia recognition from spontaneous speech