Abstract:The advent of Large Language Models (LLMs) has paved the way for AI search engines, e.g., SearchGPT, showcasing a new paradigm in human-internet interaction. However, most current AI search engines are limited to text-only settings, neglecting the multimodal user queries and the text-image interleaved nature of website information. Recently, Large Multimodal Models (LMMs) have made impressive strides. Yet, whether they can function as AI search engines remains under-explored, leaving the potential of LMMs in multimodal search an open question. To this end, we first design a delicate pipeline, MMSearch-Engine, to empower any LMMs with multimodal search capabilities. On top of this, we introduce MMSearch, a comprehensive evaluation benchmark to assess the multimodal search performance of LMMs. The curated dataset contains 300 manually collected instances spanning 14 subfields, which involves no overlap with the current LMMs' training data, ensuring the correct answer can only be obtained within searching. By using MMSearch-Engine, the LMMs are evaluated by performing three individual tasks (requery, rerank, and summarization), and one challenging end-to-end task with a complete searching process. We conduct extensive experiments on closed-source and open-source LMMs. Among all tested models, GPT-4o with MMSearch-Engine achieves the best results, which surpasses the commercial product, Perplexity Pro, in the end-to-end task, demonstrating the effectiveness of our proposed pipeline. We further present error analysis to unveil current LMMs still struggle to fully grasp the multimodal search tasks, and conduct ablation study to indicate the potential of scaling test-time computation for AI search engine. We hope MMSearch may provide unique insights to guide the future development of multimodal AI search engine. Project Page: <a class="link-external link-https" href="https://mmsearch.github.io" rel="external noopener nofollow">this https URL</a>

MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Sogou-ST: A New Dataset with Large-scale Refined Real-world Web Search Sessions

Tiangong-St: A New Dataset With Large-Scale Refined Real-World Web Search Sessions

MIRA: Leveraging Multi-Intention Co-click Information in Web-scale Document Retrieval using Deep Neural Networks

Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs

Click chain model in web search.

Clickage: towards bridging semantic and intent gaps via mining click logs of search engines.

Event-driven Real-time Retrieval in Web Search

MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines

SogouQ: The First Large-Scale Test Collection with Click Streams Used in a Shared-Task Evaluation

Efficient Multiple-Click Models in Web Search

Understanding the User: An Intent-Based Ranking Dataset

A Context-Aware Click Model for Web Search.

BASES: Large-scale Web Search User Simulation with Large Language Model based Agents

The tale of two MS MARCO -- and their unfair comparisons

CWRCzech: 100M Query-Document Czech Click Dataset and Its Application to Web Relevance Ranking

Optimizing web search using web click-through data.

User Behavior Modeling for Better Web Search Ranking

RedStone: Curating General, Code, Math, and QA Data for Large Language Models

CMOSE: Comprehensive Multi-Modality Online Student Engagement Dataset with High-Quality Labels