Abstract:Audio-based disease prediction is emerging as a promising supplement to traditional medical diagnosis methods, facilitating early, convenient, and non-invasive disease detection and prevention. Multimodal fusion, which integrates features from various domains within or across bio-acoustic modalities, has proven effective in enhancing diagnostic performance. However, most existing methods in the field employ unilateral fusion strategies that focus solely on either intra-modal or inter-modal fusion. This approach limits the full exploitation of the complementary nature of diverse acoustic feature domains and bio-acoustic modalities. Additionally, the inadequate and isolated exploration of latent dependencies within modality-specific and modality-shared spaces curtails their capacity to manage the inherent heterogeneity in multimodal data. To fill these gaps, we propose a transformer-based hierarchical fusion network designed for general multimodal audio-based disease prediction. Specifically, we seamlessly integrate intra-modal and inter-modal fusion in a hierarchical manner and proficiently encode the necessary intra-modal and inter-modal complementary correlations, respectively. Comprehensive experiments demonstrate that our model achieves state-of-the-art performance in predicting three diseases: COVID-19, Parkinson's disease, and pathological dysarthria, showcasing its promising potential in a broad context of audio-based disease prediction tasks. Additionally, extensive ablation studies and qualitative analyses highlight the significant benefits of each main component within our model.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the limitations of existing methods in audio - based multi - modal disease prediction, specifically including the following aspects: 1. **Limitations of One - Way Fusion Strategies**: - Most of the existing research only adopts a **single - direction fusion strategy**, that is, only focuses on **intra - modal** or **inter - modal** fusion. This approach limits the ability to fully mine complementary information from different feature domains and bioacoustic modalities. - Although the intra - modal fusion method can capture multiple characteristics within a specific bioacoustic modality, it often ignores the synergy that can be obtained by integrating multiple modalities. Conversely, although the inter - modal method can provide these benefits, it may not be in - depth enough when dealing with the complex feature domain relationships within each modality. 2. **Insufficient Exploration of Potential Dependencies**: - Bioacoustic features and modalities are inherently heterogeneous and potentially complementary, and each modality provides unique insights crucial for disease diagnosis. In order to efficiently learn the dependencies of these features in the modality - specific space and the modality - shared space, it is necessary to explore potential dependencies more deeply. - At present, most research relies on simple alignment and splicing to fuse features, or processes each feature or modality separately, which is not sufficient to capture the complex dependencies within and between modalities. 3. **Limited Applicability in General Scenarios**: - Different bioacoustic features and modalities have different sensitivities and effectiveness for different diseases and task scenarios. Therefore, existing research usually requires a meticulous feature selection process to customize the model to achieve high performance in specific task settings. This method requires a large amount of prior knowledge and cross - validation, thus limiting its application in a wider range of diseases and scenarios. To solve these problems, the paper proposes a Transformer - based hierarchical fusion network (AuD - Former) for general multi - modal audio disease prediction. Through the hierarchical fusion strategy, while emphasizing intra - modal and inter - modal fusion, this model effectively utilizes the complementarity between different feature domains and bioacoustic modalities. In addition, this model also introduces a modality - specific representation learning module and an inter - modal representation learning module to better capture the potential dependencies within and between modalities, thereby improving the generalization ability and prediction performance of the model. ### Specific Contributions - **Hierarchical Fusion Strategy**: Proposed a hierarchical fusion strategy that combines intra - modal and inter - modal fusion, effectively utilizing the complementarity between different feature domains and bioacoustic modalities. - **Learning of Modality - Specific and Shared Spaces**: Introduced intra - modal and inter - modal representation learning modules, enabling hierarchical fusion to query more informative multi - modal representations, reducing the need for meticulous feature selection and enhancing the overall generalization ability of the model. - **Extensive Experimental Verification**: Through the evaluation of five datasets, covering three different diseases (COVID - 19, Parkinson's disease, and dysarthria), it has been proven that this model has superior performance in multi - modal audio disease prediction tasks. In addition, ablation studies and qualitative analysis further explored the contributions of each major component in the framework. Through these improvements, AuD - Former provides a more comprehensive and effective multi - modal audio disease prediction solution.

Multimodal Audio-based Disease Prediction with Transformer-based Hierarchical Fusion Network

Multimodal Fusion with Cross-attention Transformer for HCC Early Recurrence Prediction from Multi-Phase CT and Clinical Data

TTMFN: Two-stream Transformer-based Multimodal Fusion Network for Survival Prediction

Transformer-Based Multi-Modal Data Fusion Method for COPD Classification and Physiological and Biochemical Indicators Identification

Application of Multimodal Fusion Deep Learning Model in Disease Recognition

Multimodal Data Hybrid Fusion and Natural Language Processing for Clinical Prediction Models

A multimodal cross-transformer-based model to predict mild cognitive impairment using speech, language and vision

Incomplete Multimodal Learning for Complex Brain Disorders Prediction

Attentive-based Multi-level Feature Fusion for Voice Disorder Diagnosis

Multimodal Deep Learning for Mental Disorders Prediction from Audio Speech Samples

DeAF: A Multimodal Deep Learning Framework for Disease Prediction

Multi-modal Fusion Network with Intra- and Inter-Modality Attention for Prognosis Prediction in Breast Cancer

A transformer-based representation-learning model with unified processing of multimodal input for clinical diagnostics

TriFusion enables accurate prediction of miRNA-disease association by a tri-channel fusion neural network

Research on Multimodal Fusion of Temporal Electronic Medical Records

A Multimodal Affinity Fusion Network for Predicting the Survival of Breast Cancer Patients

Mechanism of the inhibition of aldehyde dehydrogenase in vivo by disulfiram and diethyldithiocarbamate.

Missing-modality Enabled Multi-modal Fusion Architecture for Medical Data

Transformer-Based Classification Outcome Prediction for Multimodal Stroke Treatment

Hybrid Multimodality Fusion with Cross-Domain Knowledge Transfer to Forecast Progression Trajectories in Cognitive Decline