Abstract:Heterogeneous graphs provide a compact, efficient, and scalable way to model data involving multiple disparate modalities. This makes modeling audiovisual data using heterogeneous graphs an attractive option. However, graph structure does not appear naturally in audiovisual data. Graphs for audiovisual data are constructed manually which is both difficult and sub-optimal. In this work, we address this problem by (i) proposing a parametric graph construction strategy for the intra-modal edges, and (ii) learning the crossmodal edges. To this end, we develop a new model, heterogeneous graph crossmodal network (HGCN) that learns the crossmodal edges. Our proposed model can adapt to various spatial and temporal scales owing to its parametric construction, while the learnable crossmodal edges effectively connect the relevant nodes across modalities. Experiments on a large benchmark dataset (AudioSet) show that our model is state-of-the-art (0.53 mean average precision), outperforming transformer-based models and other graph-based models.

What problem does this paper attempt to address?

This paper attempts to solve the problem of how to effectively use heterogeneous graphs to model multimodal data in acoustic event classification. Specifically, the paper focuses on how to construct and learn cross - modal edges between audio and visual modalities to improve the performance of acoustic event classification in the absence of a naturally - existing graph structure. ### Core Problems of the Paper 1. **Graph Structure Construction of Multimodal Data**: - Audio and video data do not have a natural graph structure by themselves and need to be manually constructed, which is both difficult and sub - optimal. - The paper proposes a parameterized graph construction strategy for constructing intra - modal edges and solves this problem by learning cross - modal edges. 2. **Cross - Modal Learning**: - Existing multimodal learning methods usually rely on models in computer vision tasks, which cannot well capture the temporal relationships between the two modalities. - The paper proposes a new model - Heterogeneous Graph Cross - Modal Network (HGCN), which effectively connects relevant nodes of different modalities by learning cross - modal edges, thereby better capturing multimodal information. ### Solutions 1. **Parameterized Graph Construction Strategy**: - By introducing two parameters (time span and dilation coefficient) to control the construction of intra - modal edges, the graph structure can adapt to different spatio - temporal scales. - This parameterized construction strategy makes the graph structure more flexible and controllable. 2. **HGCN Model**: - **Modality - Specific Layers**: Process audio and video sub - graphs respectively and extract the features of each node. - **Cross - Modal Graph Learning Layer**: Fuse information of different modalities by learning cross - modal edges, thereby better capturing the interaction relationships among multimodalities. - **Learnable Pooling Layer**: Obtain the representation of the entire graph through the learned pooling function to further improve the performance of the model. ### Experimental Results - The paper conducted experiments on the large - scale benchmark dataset AudioSet. The results show that the HGCN model has achieved state - of - the - art performance in terms of mean Average Precision (mAP) and Area Under the Receiver Operating Characteristic Curve (ROC - AUC). - Compared with existing self - supervised and supervised models, the HGCN model has improved by more than 3% in mAP and reached 0.94 in ROC - AUC, indicating higher prediction reliability. ### Conclusion The paper successfully solves the graph structure construction and cross - modal learning problems of multimodal data in acoustic event classification by proposing an end - to - end graph learning method, significantly improving the performance of the model. This method is not only superior to existing methods in performance but also more efficient in the number of parameters.

Heterogeneous Graph Learning for Acoustic Event Classification

Visually-aware Acoustic Event Detection using Heterogeneous Graphs

Heterogeneous Deep Graph Infomax

Multi-dimensional Edge-based Audio Event Relational Graph Representation Learning for Acoustic Scene Classification

TMac: Temporal Multi-Modal Graph Learning for Acoustic Event Classification

Heterogeneous Hypergraph Embedding for Graph Classification

Heterogeneous Graph Sparsification for Efficient Representation Learning

Fine-Grained Multimodal DeepFake Classification via Heterogeneous Graphs

Heterogeneous Graph Contrastive Learning With Augmentation Graph

Audio Event-Relational Graph Representation Learning for Acoustic Scene Classification

Node classification oriented Adaptive Multichannel Heterogeneous Graph Neural Network

A Heterogeneous Graph Based Framework for Multimodal Neuroimaging Fusion Learning

Heterogeneous Graph Neural Networks using Self-supervised Reciprocally Contrastive Learning

Heterogeneous graph convolutional network for multi-view semi-supervised classification

Addressing Heterogeneity and Heterophily in Graphs: A Heterogeneous Heterophilic Spectral Graph Neural Network

Joint Prediction of Audio Event and Annoyance Rating in an Urban Soundscape by Hierarchical Graph Representation Learning

Graph-Driven Generative Models for Heterogeneous Multi-Task Learning

Attentional Graph Convolutional Network for Structure-aware Audio-Visual Scene Classification

Learning on heterogeneous graphs using high-order relations

Heterogeneous Graph Contrastive Multi-view Learning