Abstract:The fast dissemination speed and wide range of information dissemination on social media also enable false information and rumors to spread rapidly on public social media. Attackers can use false information to trigger public panic and disrupt social stability. Traditional multimodal sentiment analysis methods face challenges due to the suboptimal fusion of multimodal features and consequent diminution in classification accuracy. To address these issues, this study introduces a novel emotion classification model. The model solves the problem of interaction between modalities, which is neglected by the direct fusion of multimodal features, and improves the model's ability to understand and generalize the semantics of emotions. The Transformer's encoding layer is applied to extract sophisticated sentiment semantic encodings from audio and textual sequences. Subsequently, a complex bimodal feature interaction fusion attention mechanism is deployed to scrutinize intramodal and intermodal correlations and capture contextual dependencies. This approach enhances the model's capacity to comprehend and extrapolate sentiment semantics. The cross‐modal fused features are incorporated into the classification layer, enabling sentiment prediction. Experimental testing on the IEMOCAP dataset demonstrates that the proposed model achieves an emotion recognition classification accuracy of 78.5% and an F1‐score of 77.6%. Compared to other mainstream multimodal emotion recognition methods, the proposed model shows significant improvements in all metrics. The experimental results demonstrate that the proposed method based on the Transformer and interactive attention mechanism can more fully understand the information of discourse emotion features in the network model. This research provides robust technical support for social network public sentiment security monitoring.

Transformer-Based Interactive Multi-Modal Attention Network for Video Sentiment Detection

TensorFormer: A Tensor-Based Multimodal Transformer for Multimodal Sentiment Analysis and Depression Detection

Video Sentiment Analysis with Bimodal Information-augmented Multi-Head Attention

TEDT: Transformer-Based Encoding–Decoding Translation Network for Multimodal Sentiment Analysis

A transformer-encoder-based multimodal multi-attention fusion network for sentiment analysis

Multimodal Sentiment Analysis Using Multi-tensor Fusion Network with Cross-modal Modeling

A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism

Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos

Multi-layer cross-modality attention fusion network for multimodal sentiment analysis

Social Media Public Opinion Detection Using Multimodal Natural Language Processing and Attention Mechanisms

MEDT: Using Multimodal Encoding-Decoding Network as in Transformer for Multimodal Sentiment Analysis

MATF: main-auxiliary transformer fusion for multi-modal sentiment analysis

A multimodal sentiment recognition method based on attention mechanism

Multimodal Sentiment Analysis Based on a Cross-Modal Multihead Attention Mechanism

Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism

Cross-modal sentiment analysis based on Transformer and image-text collaborative interaction

Multi-Channel Attentive Graph Convolutional Network with Sentiment Fusion for Multimodal Sentiment Analysis

Multimodal sentiment analysis based on multiple attention

Multimodal interaction enhanced representation learning for video emotion recognition

Feature Extraction Network with Attention Mechanism for Data Enhancement and Recombination Fusion for Multimodal Sentiment Analysis

A Multimodal Sentiment Analysis Method Integrating Multi-Layer Attention Interaction and Multi-Feature Enhancement