SoMeLVLM: A Large Vision Language Model for Social Media Processing

Xinnong Zhang,Haoyu Kuang,Xinyi Mou,Hanjia Lyu,Kun Wu,Siming Chen,Jiebo Luo,Xuanjing Huang,Zhongyu Wei
2024-02-20
Abstract:The growth of social media, characterized by its multimodal nature, has led to the emergence of diverse phenomena and challenges, which calls for an effective approach to uniformly solve automated tasks. The powerful Large Vision Language Models make it possible to handle a variety of tasks simultaneously, but even with carefully designed prompting methods, the general domain models often fall short in aligning with the unique speaking style and context of social media tasks. In this paper, we introduce a Large Vision Language Model for Social Media Processing (SoMeLVLM), which is a cognitive framework equipped with five key capabilities including knowledge & comprehension, application, analysis, evaluation, and creation. SoMeLVLM is designed to understand and generate realistic social media behavior. We have developed a 654k multimodal social media instruction-tuning dataset to support our cognitive framework and fine-tune our model. Our experiments demonstrate that SoMeLVLM achieves state-of-the-art performance in multiple social media tasks. Further analysis shows its significant advantages over baselines in terms of cognitive abilities.
Computation and Language,Multimedia
What problem does this paper attempt to address?
The paper aims to address several key challenges faced by Large Vision-Language Models (LVLM) when processing social media data, in order to improve the model's performance and adaptability in the social media domain. Specifically, the paper investigates the following three main issues: 1. **Limitations of Multimodal Understanding**: Existing Large Language Models (LLM) or Large Vision-Language Models often focus more on processing textual information while neglecting other modalities such as images. This does not align with the actual usage habits of social media users, as information on social media typically requires a combined understanding of images and text to fully grasp the user's intent. 2. **Challenges in Understanding Informal Language**: The language style on social media is very informal and full of variations, including but not limited to emotional expressions, humor, and figurative language. Existing general-domain LLMs and LVLMs find it difficult to recognize these elements in such an informal language environment. 3. **Complex Cognitive Demands in Social Media Tasks**: Social media tasks often require models to possess multiple advanced cognitive abilities, such as simultaneously detecting hate speech and rewriting content. However, existing models lack sufficient capabilities in this regard, leading to unsatisfactory output results. To address the above issues, the paper proposes a Large Vision-Language Model named SoMeLVLM, which is specifically designed for social media processing through comprehensive supervised fine-tuning. SoMeLVLM establishes a framework with five cognitive levels (knowledge and understanding, application, analysis, evaluation, and creation) and constructs a large-scale multimodal dataset containing 654,000 instances to train the model to possess these cognitive abilities. Experimental results show that SoMeLVLM achieves state-of-the-art performance on multiple social media tasks and excels at different levels of cognitive abilities.