Abstract:The fast dissemination speed and wide range of information dissemination on social media also enable false information and rumors to spread rapidly on public social media. Attackers can use false information to trigger public panic and disrupt social stability. Traditional multimodal sentiment analysis methods face challenges due to the suboptimal fusion of multimodal features and consequent diminution in classification accuracy. To address these issues, this study introduces a novel emotion classification model. The model solves the problem of interaction between modalities, which is neglected by the direct fusion of multimodal features, and improves the model's ability to understand and generalize the semantics of emotions. The Transformer's encoding layer is applied to extract sophisticated sentiment semantic encodings from audio and textual sequences. Subsequently, a complex bimodal feature interaction fusion attention mechanism is deployed to scrutinize intramodal and intermodal correlations and capture contextual dependencies. This approach enhances the model's capacity to comprehend and extrapolate sentiment semantics. The cross‐modal fused features are incorporated into the classification layer, enabling sentiment prediction. Experimental testing on the IEMOCAP dataset demonstrates that the proposed model achieves an emotion recognition classification accuracy of 78.5% and an F1‐score of 77.6%. Compared to other mainstream multimodal emotion recognition methods, the proposed model shows significant improvements in all metrics. The experimental results demonstrate that the proposed method based on the Transformer and interactive attention mechanism can more fully understand the information of discourse emotion features in the network model. This research provides robust technical support for social network public sentiment security monitoring.

Transformer Based Multi-modal Memory-augmented Masked Network for Air Crisis Event Detection

Multimodal tweet classification in disaster response systems using transformer-based bidirectional attention model

CrisisBERT: a Robust Transformer for Crisis Classification and Contextual Crisis Embedding

Multi-modal deep learning framework for damage detection in social media posts

Social Media Public Opinion Detection Using Multimodal Natural Language Processing and Attention Mechanisms

A social media event detection framework based on transformers and swarm optimization for public notification of crises and emergency management

CrisisViT: A Robust Vision Transformer for Crisis Image Classification

Transformer-Based Interactive Multi-Modal Attention Network for Video Sentiment Detection

A Social Context-aware Graph-based Multimodal Attentive Learning Framework for Disaster Content Classification during Emergencies

Disaster assessment from social media using multimodal deep learning

Remote sensing building damage assessment with a multihead neighbourhood attention transformer

Transformer-based Multi-task Learning for Disaster Tweet Categorisation

CrisisKAN: Knowledge-infused and Explainable Multimodal Attention Network for Crisis Event Classification

A Multimodal Fusion Network For Student Emotion Recognition Based on Transformer and Tensor Product

Multi-task Multimodal Learning for Disaster Situation Assessment.

Disaster Image Classification by Fusing Multimodal Social Media Data

Multimodal Transformer Using Cross-Channel attention for Object Detection in Remote Sensing Images

Multimodal transformer for early alarm prediction

A multimodal hyper-fusion transformer for remote sensing image classification

Multi-View And Multi-Modal Event Detection Utilizing Transformer-Based Multi-Sensor Fusion

Multi-task transfer learning for finding actionable information from crisis-related messages on social media