Abstract:Recent research has offered insights into the extraordinary capabilities of Large Multimodal Models (LMMs) in various general vision and language tasks. There is growing interest in how LMMs perform in more specialized domains. Social media content, inherently multimodal, blends text, images, videos, and sometimes audio. Understanding social multimedia content remains a challenging problem for contemporary machine learning frameworks. In this paper, we explore GPT-4V(ision)'s capabilities for social multimedia analysis. We select five representative tasks, including sentiment analysis, hate speech detection, fake news identification, demographic inference, and political ideology detection, to evaluate GPT-4V. Our investigation begins with a preliminary quantitative analysis for each task using existing benchmark datasets, followed by a careful review of the results and a selection of qualitative samples that illustrate GPT-4V's potential in understanding multimodal social media content. GPT-4V demonstrates remarkable efficacy in these tasks, showcasing strengths such as joint understanding of image-text pairs, contextual and cultural awareness, and extensive commonsense knowledge. Despite the overall impressive capacity of GPT-4V in the social media domain, there remain notable challenges. GPT-4V struggles with tasks involving multilingual social multimedia comprehension and has difficulties in generalizing to the latest trends in social media. Additionally, it exhibits a tendency to generate erroneous information in the context of evolving celebrity and politician knowledge, reflecting the known hallucination problem. The insights gleaned from our findings underscore a promising future for LMMs in enhancing our comprehension of social media content and its users through the analysis of multimodal information.

Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation

An Early Evaluation of GPT-4V(ision)

GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks

A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging

Putting GPT-4o to the Sword: A Comprehensive Evaluation of Language, Vision, Speech, and Multimodal Proficiency

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?

A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical Image Analysis

Exploring Recommendation Capabilities of GPT-4V(ision): A Preliminary Case Study

Can GPT-4V(ision) Serve Medical Applications? Case Studies on GPT-4V for Multimodal Medical Diagnosis

Exploring Visual Culture Awareness in GPT-4V: A Comprehensive Probing

Holistic Evaluation of GPT-4V for Biomedical Imaging

GPT-4V(ision) as A Social Media Analysis Engine

On the Promises and Challenges of Multimodal Foundation Models for Geographical, Environmental, Agricultural, and Urban Planning Applications

Map Reading and Analysis with GPT-4V(ision)

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

GPT as Psychologist? Preliminary Evaluations for GPT-4V on Visual Affective Computing

VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding

From Text to Image: Exploring GPT-4Vision's Potential in Advanced Radiological Analysis across Subspecialties

Notes on Applicability of GPT-4 to Document Understanding

Comparative Analysis of GPT-4Vision, GPT-4 and Open Source LLMs in Clinical Diagnostic Accuracy: A Benchmark Against Human Expertise