Heterogeneous Graph Learning for Acoustic Event Classification

Amir Shirian,Mona Ahmadian,Krishna Somandepalli,Tanaya Guha
2023-03-12
Abstract:Heterogeneous graphs provide a compact, efficient, and scalable way to model data involving multiple disparate modalities. This makes modeling audiovisual data using heterogeneous graphs an attractive option. However, graph structure does not appear naturally in audiovisual data. Graphs for audiovisual data are constructed manually which is both difficult and sub-optimal. In this work, we address this problem by (i) proposing a parametric graph construction strategy for the intra-modal edges, and (ii) learning the crossmodal edges. To this end, we develop a new model, heterogeneous graph crossmodal network (HGCN) that learns the crossmodal edges. Our proposed model can adapt to various spatial and temporal scales owing to its parametric construction, while the learnable crossmodal edges effectively connect the relevant nodes across modalities. Experiments on a large benchmark dataset (AudioSet) show that our model is state-of-the-art (0.53 mean average precision), outperforming transformer-based models and other graph-based models.
Sound,Machine Learning,Multimedia,Audio and Speech Processing
What problem does this paper attempt to address?
This paper attempts to solve the problem of how to effectively use heterogeneous graphs to model multimodal data in acoustic event classification. Specifically, the paper focuses on how to construct and learn cross - modal edges between audio and visual modalities to improve the performance of acoustic event classification in the absence of a naturally - existing graph structure. ### Core Problems of the Paper 1. **Graph Structure Construction of Multimodal Data**: - Audio and video data do not have a natural graph structure by themselves and need to be manually constructed, which is both difficult and sub - optimal. - The paper proposes a parameterized graph construction strategy for constructing intra - modal edges and solves this problem by learning cross - modal edges. 2. **Cross - Modal Learning**: - Existing multimodal learning methods usually rely on models in computer vision tasks, which cannot well capture the temporal relationships between the two modalities. - The paper proposes a new model - Heterogeneous Graph Cross - Modal Network (HGCN), which effectively connects relevant nodes of different modalities by learning cross - modal edges, thereby better capturing multimodal information. ### Solutions 1. **Parameterized Graph Construction Strategy**: - By introducing two parameters (time span and dilation coefficient) to control the construction of intra - modal edges, the graph structure can adapt to different spatio - temporal scales. - This parameterized construction strategy makes the graph structure more flexible and controllable. 2. **HGCN Model**: - **Modality - Specific Layers**: Process audio and video sub - graphs respectively and extract the features of each node. - **Cross - Modal Graph Learning Layer**: Fuse information of different modalities by learning cross - modal edges, thereby better capturing the interaction relationships among multimodalities. - **Learnable Pooling Layer**: Obtain the representation of the entire graph through the learned pooling function to further improve the performance of the model. ### Experimental Results - The paper conducted experiments on the large - scale benchmark dataset AudioSet. The results show that the HGCN model has achieved state - of - the - art performance in terms of mean Average Precision (mAP) and Area Under the Receiver Operating Characteristic Curve (ROC - AUC). - Compared with existing self - supervised and supervised models, the HGCN model has improved by more than 3% in mAP and reached 0.94 in ROC - AUC, indicating higher prediction reliability. ### Conclusion The paper successfully solves the graph structure construction and cross - modal learning problems of multimodal data in acoustic event classification by proposing an end - to - end graph learning method, significantly improving the performance of the model. This method is not only superior to existing methods in performance but also more efficient in the number of parameters.