Abstract:Various linguistic and non-linguistic clues, such as excessive emphasis on a word, a shift in the tone of voice, or an awkward expression, frequently convey sarcasm. The computer vision problem of sarcasm recognition in conversation aims to identify hidden sarcastic, criticizing, and metaphorical information embedded in everyday dialogue. Prior, sarcasm recognition has focused mainly on text. Still, it is critical to consider all textual information, audio stream, facial expression, and body position for reliable sarcasm identification. Hence, we propose a novel approach that combines a lightweight depth attention module with a self-regulated ConvNet to concentrate on the most crucial features of visual data and an attentional tokenizer based strategy to extract the most critical context-specific information from the textual data. The following is a list of the key contributions that our experimentation has made in response to performing the task of Multi-modal Sarcasm Recognition: an attentional tokenizer branch to get beneficial features from the glossary content provided by the subtitles; a visual branch for acquiring the most prominent features from the video frames; an utterance-level feature extraction from acoustic content and a multi-headed attention based feature fusion branch to blend features obtained from multiple modalities. Extensive testing on one of the benchmark video datasets, MUSTaRD, yielded an accuracy of 79.86% for speaker dependent and 76.94% for speaker independent configuration demonstrating that our approach is superior to the existing methods. We have also conducted a cross-dataset analysis to test the adaptability of VyAnG-Net with unseen samples of another dataset MUStARD++.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the problem of Multi-Modal Sarcasm Recognition (MSR). Specifically, the researchers propose a novel multi-modal sarcasm recognition framework named "VyAnG-Net," which improves recognition accuracy by integrating visual, auditory, and textual (subtitle) features. #### Research Background and Motivation 1. **Importance of Multi-Modal Data**: With the increase of videos and multi-modal content on social media, relying solely on textual information is insufficient for accurate sarcasm recognition. Therefore, combining visual, auditory, and textual information is crucial for sarcasm detection. 2. **Limitations of Existing Methods**: Although many unimodal methods exist for sarcasm recognition, they struggle to capture all critical features in multi-modal scenarios. #### Main Contributions 1. **Proposing a Novel Framework**: VyAnG-Net combines a lightweight deep attention module and a Self-Regulated ConvNet, integrating features from different modalities through a multi-head attention mechanism. 2. **Experimental Validation**: Tested on the standard video dataset MUStARD, achieving significantly better results than existing methods. 3. **Robustness Verification**: Conducted a series of ablation experiments to verify the robustness of the proposed method and performed cross-dataset testing to validate its generalization capability. ### Technical Details 1. **Text Branch**: Utilizes an attention-based tokenization method to extract contextual features from subtitles. 2. **Visual Branch**: Employs a lightweight deep attention module to extract key features from video frames. 3. **Auditory Feature Extraction**: Extracts sentence-level features from audio content. 4. **Feature Fusion**: Uses a multi-head attention mechanism to integrate features from different modalities. Through these technical means, the researchers aim to accurately recognize sarcasm in complex multi-modal environments, thereby enhancing the accuracy and reliability of opinion mining.

VyAnG-Net: A Novel Multi-Modal Sarcasm Recognition Model by Uncovering Visual, Acoustic and Glossary Features

Modelling Visual Semantics via Image Captioning to extract Enhanced Multi-Level Cross-Modal Semantic Incongruity Representation with Attention for Multimodal Sarcasm Detection

A smart video analytical framework for sarcasm detection using novel adaptive fusion network and SarcasNet-99 model

Interpretable Multi-Head Self-Attention Architecture for Sarcasm Detection in Social Media

AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation

Interpretable Multi-Head Self-Attention model for Sarcasm Detection in social media

Multimodal Sarcasm Detection via Hybrid Classifier with Optimistic Logic

Attention-based multi-modal fusion sarcasm detection

Sarcasm Detection of Dual Multimodal Contrastive Attention Networks.

An attention approach to emoji focused sarcasm detection

MIAN: Multi-head Incongruity Aware Attention Network with Transfer Learning for Sarcasm Detection

Sarcasm detection using optimized bi-directional long short-term memory

KnowleNet: Knowledge fusion network for multimodal sarcasm detection

When did you become so smart, oh wise one?! Sarcasm Explanation in Multi-modal Multi-party Dialogues

Dual-level adaptive incongruity-enhanced model for multimodal sarcasm detection

MMSD-CAF: MultiModal Sarcasm Detection using CoAttention and Fusion Mechanisms

Sarcasm in Sight and Sound: Benchmarking and Expansion to Improve Multimodal Sarcasm Detection

S3 Agent: Unlocking the Power of VLLM for Zero-Shot Multi-modal Sarcasm Detection

DIP: Dual Incongruity Perceiving Network for Sarcasm Detection

Fusion and Discrimination: A Multimodal Graph Contrastive Learning Framework for Multimodal Sarcasm Detection

BNS-Net: A Dual-channel Sarcasm Detection Method Considering Behavior-level and Sentence-level Conflicts