Jinjin Cai,Ruiqi Wang,Dezhong Zhao,Ziqin Yuan,Victoria McKenna,Aaron Friedman,Rachel Foot,Susan Storey,Ryan Boente,Sudip Vhaduri,Byung-Cheol Min
Abstract:Audio-based disease prediction is emerging as a promising supplement to traditional medical diagnosis methods, facilitating early, convenient, and non-invasive disease detection and prevention. Multimodal fusion, which integrates features from various domains within or across bio-acoustic modalities, has proven effective in enhancing diagnostic performance. However, most existing methods in the field employ unilateral fusion strategies that focus solely on either intra-modal or inter-modal fusion. This approach limits the full exploitation of the complementary nature of diverse acoustic feature domains and bio-acoustic modalities. Additionally, the inadequate and isolated exploration of latent dependencies within modality-specific and modality-shared spaces curtails their capacity to manage the inherent heterogeneity in multimodal data. To fill these gaps, we propose a transformer-based hierarchical fusion network designed for general multimodal audio-based disease prediction. Specifically, we seamlessly integrate intra-modal and inter-modal fusion in a hierarchical manner and proficiently encode the necessary intra-modal and inter-modal complementary correlations, respectively. Comprehensive experiments demonstrate that our model achieves state-of-the-art performance in predicting three diseases: COVID-19, Parkinson's disease, and pathological dysarthria, showcasing its promising potential in a broad context of audio-based disease prediction tasks. Additionally, extensive ablation studies and qualitative analyses highlight the significant benefits of each main component within our model.
What problem does this paper attempt to address?
### What problems does this paper attempt to solve?
This paper aims to solve the limitations of existing methods in audio - based multi - modal disease prediction, specifically including the following aspects:
1. **Limitations of One - Way Fusion Strategies**:
- Most of the existing research only adopts a **single - direction fusion strategy**, that is, only focuses on **intra - modal** or **inter - modal** fusion. This approach limits the ability to fully mine complementary information from different feature domains and bioacoustic modalities.
- Although the intra - modal fusion method can capture multiple characteristics within a specific bioacoustic modality, it often ignores the synergy that can be obtained by integrating multiple modalities. Conversely, although the inter - modal method can provide these benefits, it may not be in - depth enough when dealing with the complex feature domain relationships within each modality.
2. **Insufficient Exploration of Potential Dependencies**:
- Bioacoustic features and modalities are inherently heterogeneous and potentially complementary, and each modality provides unique insights crucial for disease diagnosis. In order to efficiently learn the dependencies of these features in the modality - specific space and the modality - shared space, it is necessary to explore potential dependencies more deeply.
- At present, most research relies on simple alignment and splicing to fuse features, or processes each feature or modality separately, which is not sufficient to capture the complex dependencies within and between modalities.
3. **Limited Applicability in General Scenarios**:
- Different bioacoustic features and modalities have different sensitivities and effectiveness for different diseases and task scenarios. Therefore, existing research usually requires a meticulous feature selection process to customize the model to achieve high performance in specific task settings. This method requires a large amount of prior knowledge and cross - validation, thus limiting its application in a wider range of diseases and scenarios.
To solve these problems, the paper proposes a Transformer - based hierarchical fusion network (AuD - Former) for general multi - modal audio disease prediction. Through the hierarchical fusion strategy, while emphasizing intra - modal and inter - modal fusion, this model effectively utilizes the complementarity between different feature domains and bioacoustic modalities. In addition, this model also introduces a modality - specific representation learning module and an inter - modal representation learning module to better capture the potential dependencies within and between modalities, thereby improving the generalization ability and prediction performance of the model.
### Specific Contributions
- **Hierarchical Fusion Strategy**: Proposed a hierarchical fusion strategy that combines intra - modal and inter - modal fusion, effectively utilizing the complementarity between different feature domains and bioacoustic modalities.
- **Learning of Modality - Specific and Shared Spaces**: Introduced intra - modal and inter - modal representation learning modules, enabling hierarchical fusion to query more informative multi - modal representations, reducing the need for meticulous feature selection and enhancing the overall generalization ability of the model.
- **Extensive Experimental Verification**: Through the evaluation of five datasets, covering three different diseases (COVID - 19, Parkinson's disease, and dysarthria), it has been proven that this model has superior performance in multi - modal audio disease prediction tasks. In addition, ablation studies and qualitative analysis further explored the contributions of each major component in the framework.
Through these improvements, AuD - Former provides a more comprehensive and effective multi - modal audio disease prediction solution.