Abstract:Social relation reasoning aims to identify relation categories such as friends, spouses, and colleagues from images. While current methods adopt the paradigm of training a dedicated network end-to-end using labeled image data, they are limited in terms of generalizability and interpretability. To address these issues, we first present a simple yet well-crafted framework named {\name}, which combines the perception capability of Vision Foundation Models (VFMs) and the reasoning capability of Large Language Models (LLMs) within a modular framework, providing a strong baseline for social relation recognition. Specifically, we instruct VFMs to translate image content into a textual social story, and then utilize LLMs for text-based reasoning. {\name} introduces systematic design principles to adapt VFMs and LLMs separately and bridge their gaps. Without additional model training, it achieves competitive zero-shot results on two databases while offering interpretable answers, as LLMs can generate language-based explanations for the decisions. The manual prompt design process for LLMs at the reasoning phase is tedious and an automated prompt optimization method is desired. As we essentially convert a visual classification task into a generative task of LLMs, automatic prompt optimization encounters a unique long prompt optimization issue. To address this issue, we further propose the Greedy Segment Prompt Optimization (GSPO), which performs a greedy search by utilizing gradient information at the segment level. Experimental results show that GSPO significantly improves performance, and our method also generalizes to different image styles. The code is available at <a class="link-external link-https" href="https://github.com/Mengzibin/SocialGPT" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper aims to address two major issues in social relationship recognition: **lack of generalization ability** and **poor interpretability**. #### Lack of Generalization Ability Existing social relationship recognition methods typically use an end-to-end training approach, training a specialized neural network on a customized dataset. While this method performs well on specific datasets, its generalization ability is limited, making it difficult to handle unseen data or images of different styles. #### Poor Interpretability Current methods lack transparency and interpretability when making decisions, making it difficult for users to understand why the model makes a particular prediction. This black-box nature limits the credibility and acceptance of these methods in practical applications. ### Solution To address the above issues, the authors propose the **SocialGPT** framework, which combines the capabilities of Visual Foundation Models (VFMs) and Large Language Models (LLMs) to achieve social relationship recognition through a modular design. Specifically: 1. **Perception Stage**: Use VFMs to convert image content into textual social stories. 2. **Inference Stage**: Utilize LLMs for text-based reasoning and generate interpretable answers. Additionally, to optimize the prompts for LLMs, the authors propose an algorithm called **Greedy Segment Prompt Optimization (GSPO)** to automatically adjust the prompt content and improve the model's performance. ### Main Contributions 1. **Modular Framework**: Proposes a simple modular framework that combines VFMs and LLMs, providing a strong zero-shot social relationship recognition baseline. 2. **Long Prompt Optimization**: For the long prompt optimization problem in visual reasoning tasks, the GSPO algorithm is proposed, which optimizes through segment-level greedy search and gradient information. 3. **Experimental Results**: Experiments show that this method can achieve highly competitive and interpretable zero-shot results without additional model training, significantly outperforming existing state-of-the-art methods. ### Conclusion By converting visual information into textual form and utilizing LLMs for reasoning, SocialGPT not only improves the generalization ability and interpretability of social relationship recognition but also provides a strong zero-shot baseline. The GSPO algorithm further optimizes the prompt content, significantly enhancing the model's performance.

SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization

VLPrompt: Vision-Language Prompting for Panoptic Scene Graph Generation

GRL-Prompt: Towards Knowledge Graph based Prompt Optimization via Reinforcement Learning

Prompt Space Optimizing Few-shot Reasoning Success with Large Language Models

Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts

LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models

Prompting Large Language Models with Fine-Grained Visual Relations from Scene Graph for Visual Question Answering

GLaPE: Gold Label-agnostic Prompt Evaluation and Optimization for Large Language Model

Mutual Prompt Leaning for Vision Language Models

Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge

Guiding Large Language Models via Directional Stimulus Prompting

Soft Prompt Generation for Domain Generalization

Graph Neural Prompting with Large Language Models

Unleashing the Potential of Large Language Models as Prompt Optimizers: An Analogical Analysis with Gradient-based Model Optimizers

Visual In-Context Prompting

Prompt Highlighter: Interactive Control for Multi-Modal LLMs

SGL-PT: A Strong Graph Learner with Graph Prompt Tuning

G-SAP: Graph-based Structure-Aware Prompt Learning over Heterogeneous Knowledge for Commonsense Reasoning

Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models

GLaPE: Gold Label-agnostic Prompt Evaluation for Large Language Models

VRPTEST: Evaluating Visual Referring Prompting in Large Multimodal Models