Abstract:The growth of social media, characterized by its multimodal nature, has led to the emergence of diverse phenomena and challenges, which calls for an effective approach to uniformly solve automated tasks. The powerful Large Vision Language Models make it possible to handle a variety of tasks simultaneously, but even with carefully designed prompting methods, the general domain models often fall short in aligning with the unique speaking style and context of social media tasks. In this paper, we introduce a Large Vision Language Model for Social Media Processing (SoMeLVLM), which is a cognitive framework equipped with five key capabilities including knowledge & comprehension, application, analysis, evaluation, and creation. SoMeLVLM is designed to understand and generate realistic social media behavior. We have developed a 654k multimodal social media instruction-tuning dataset to support our cognitive framework and fine-tune our model. Our experiments demonstrate that SoMeLVLM achieves state-of-the-art performance in multiple social media tasks. Further analysis shows its significant advantages over baselines in terms of cognitive abilities.

What problem does this paper attempt to address?

The paper aims to address several key challenges faced by Large Vision-Language Models (LVLM) when processing social media data, in order to improve the model's performance and adaptability in the social media domain. Specifically, the paper investigates the following three main issues: 1. **Limitations of Multimodal Understanding**: Existing Large Language Models (LLM) or Large Vision-Language Models often focus more on processing textual information while neglecting other modalities such as images. This does not align with the actual usage habits of social media users, as information on social media typically requires a combined understanding of images and text to fully grasp the user's intent. 2. **Challenges in Understanding Informal Language**: The language style on social media is very informal and full of variations, including but not limited to emotional expressions, humor, and figurative language. Existing general-domain LLMs and LVLMs find it difficult to recognize these elements in such an informal language environment. 3. **Complex Cognitive Demands in Social Media Tasks**: Social media tasks often require models to possess multiple advanced cognitive abilities, such as simultaneously detecting hate speech and rewriting content. However, existing models lack sufficient capabilities in this regard, leading to unsatisfactory output results. To address the above issues, the paper proposes a Large Vision-Language Model named SoMeLVLM, which is specifically designed for social media processing through comprehensive supervised fine-tuning. SoMeLVLM establishes a framework with five cognitive levels (knowledge and understanding, application, analysis, evaluation, and creation) and constructs a large-scale multimodal dataset containing 654,000 instances to train the model to possess these cognitive abilities. Experimental results show that SoMeLVLM achieves state-of-the-art performance on multiple social media tasks and excels at different levels of cognitive abilities.

SoMeLVLM: A Large Vision Language Model for Social Media Processing

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

MM-Soc: Benchmarking Multimodal Large Language Models in Social Media Platforms

VideoLLM-online: Online Video Large Language Model for Streaming Video

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

DeepSeek-VL: Towards Real-World Vision-Language Understanding

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement

Video Understanding with Large Language Models: A Survey

OCC-MLLM-Alpha:Empowering Multi-modal Large Language Model for the Understanding of Occluded Objects with Self-Supervised Test-Time Learning

A Comprehensive Survey and Guide to Multimodal Large Language Models in Vision-Language Tasks

HumanVLM: Foundation for Human-Scene Vision-Language Model

VLP2MSA: Expanding Vision-Language Pre-Training to Multimodal Sentiment Analysis

Large Language Models for Social Networks: Applications, Challenges, and Solutions

An Introduction to Vision-Language Modeling

Improving Visual Storytelling with Multimodal Large Language Models

LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation

Multimodal Large Language Models: A Survey

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

Large Language Models Meet Text-Centric Multimodal Sentiment Analysis: A Survey