MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines

Dongzhi Jiang,Renrui Zhang,Ziyu Guo,Yanmin Wu,Jiayi Lei,Pengshuo Qiu,Pan Lu,Zehui Chen,Guanglu Song,Peng Gao,Yu Liu,Chunyuan Li,Hongsheng Li

2024-09-20

Abstract:The advent of Large Language Models (LLMs) has paved the way for AI search engines, e.g., SearchGPT, showcasing a new paradigm in human-internet interaction. However, most current AI search engines are limited to text-only settings, neglecting the multimodal user queries and the text-image interleaved nature of website information. Recently, Large Multimodal Models (LMMs) have made impressive strides. Yet, whether they can function as AI search engines remains under-explored, leaving the potential of LMMs in multimodal search an open question. To this end, we first design a delicate pipeline, MMSearch-Engine, to empower any LMMs with multimodal search capabilities. On top of this, we introduce MMSearch, a comprehensive evaluation benchmark to assess the multimodal search performance of LMMs. The curated dataset contains 300 manually collected instances spanning 14 subfields, which involves no overlap with the current LMMs' training data, ensuring the correct answer can only be obtained within searching. By using MMSearch-Engine, the LMMs are evaluated by performing three individual tasks (requery, rerank, and summarization), and one challenging end-to-end task with a complete searching process. We conduct extensive experiments on closed-source and open-source LMMs. Among all tested models, GPT-4o with MMSearch-Engine achieves the best results, which surpasses the commercial product, Perplexity Pro, in the end-to-end task, demonstrating the effectiveness of our proposed pipeline. We further present error analysis to unveil current LMMs still struggle to fully grasp the multimodal search tasks, and conduct ablation study to indicate the potential of scaling test-time computation for AI search engine. We hope MMSearch may provide unique insights to guide the future development of multimodal AI search engine. Project Page: <a class="link-external link-https" href="https://mmsearch.github.io" rel="external noopener nofollow">this https URL</a>

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Information Retrieval

What problem does this paper attempt to address?

This paper aims to address the issue that the potential of current large multimodal models (LMMs) in multimodal search has not been fully explored. Specifically, existing AI search engines are mostly limited to text queries, ignoring the possibility of multimodal queries (such as queries combining images and text) that users might submit, as well as the interwoven nature of text and images in web information. To evaluate the capability of LMMs as multimodal search engines, the researchers designed a benchmark called MMS EARCH and proposed a pipeline for multimodal search engines—MMS EARCH-ENGINE. The MMS EARCH benchmark includes 300 manually collected instances covering 14 subdomains, with no overlap with existing LMMs training data, ensuring that answers can only be obtained through actual search. Additionally, the paper evaluates the performance of LMMs through three independent tasks (query reformulation, re-ranking, and summarization) and an end-to-end task. Experimental results show that GPT-4o equipped with MMS EARCH-ENGINE outperforms the commercial product Perplexity Pro in the end-to-end task, demonstrating the effectiveness of the proposed pipeline. However, error analysis also reveals that current LMMs still face challenges in handling multimodal search tasks. Overall, this study provides unique insights for the future development of multimodal AI search engines.

MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

A Survey on Benchmarks of Multimodal Large Language Models

GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

A Survey on Evaluation of Multimodal Large Language Models

Surveying the MLLM Landscape: A Meta-Review of Current Surveys

A Survey on Multimodal Benchmarks: In the Era of Large AI Models

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

Needle In A Multimodal Haystack

MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models