GIM: Learning Generalizable Image Matcher From Internet Videos

Xuelun Shen,Zhipeng Cai,Wei Yin,Matthias Müller,Zijun Li,Kaixuan Wang,Xiaozhi Chen,Cheng Wang

2024-02-17

Abstract:Image matching is a fundamental computer vision problem. While learning-based methods achieve state-of-the-art performance on existing benchmarks, they generalize poorly to in-the-wild images. Such methods typically need to train separate models for different scene types and are impractical when the scene type is unknown in advance. One of the underlying problems is the limited scalability of existing data construction pipelines, which limits the diversity of standard image matching datasets. To address this problem, we propose GIM, a self-training framework for learning a single generalizable model based on any image matching architecture using internet videos, an abundant and diverse data source. Given an architecture, GIM first trains it on standard domain-specific datasets and then combines it with complementary matching methods to create dense labels on nearby frames of novel videos. These labels are filtered by robust fitting, and then enhanced by propagating them to distant frames. The final model is trained on propagated data with strong augmentations. We also propose ZEB, the first zero-shot evaluation benchmark for image matching. By mixing data from diverse domains, ZEB can thoroughly assess the cross-domain generalization performance of different methods. Applying GIM consistently improves the zero-shot performance of 3 state-of-the-art image matching architectures; with 50 hours of YouTube videos, the relative zero-shot performance improves by 8.4%-18.1%. GIM also enables generalization to extreme cross-domain data such as Bird Eye View (BEV) images of projected 3D point clouds (Fig. 1(c)). More importantly, our single zero-shot model consistently outperforms domain-specific baselines when evaluated on downstream tasks inherent to their respective domains. The video presentation is available at

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper primarily addresses the issue of insufficient generalization capability of image matching in wild scenarios. Specifically: 1. **Limitations of Existing Methods**: - Current learning-based methods, although performing well on existing benchmarks, exhibit poor generalization capability in wild scenarios. - Existing methods typically require training models separately for different scene types (e.g., indoor and outdoor), which is impractical for unknown scene types. 2. **Limitations of Datasets**: - Current data construction methods rely on RGB-D scanning or Structure from Motion (SfM) + Multi-View Stereo (MVS), which are limited in efficiency and difficult to scale to large datasets. 3. **Proposed New Framework**: - The paper proposes GIM (Generalizable Image Matcher), a self-training framework that leverages the rich and diverse data source of internet videos to learn a generalizable image matcher. - GIM first trains the model on standard domain-specific datasets, then generates candidate correspondences by combining multiple complementary image matching methods, removes outliers through robust fitting, and finally enhances label quality by propagating these correspondences to distant frames. 4. **Zero-shot Evaluation Benchmark**: - The paper constructs ZEB (Zero-shot Evaluation Benchmark), the first benchmark for evaluating the zero-shot generalization performance of image matching methods, containing data from multiple real-world and simulated domains to comprehensively assess the cross-domain generalization capability of different methods. Through the above methods, GIM significantly improves the performance of three state-of-the-art image matching architectures (SuperGlue, LoFTR, and DKM) in zero-shot scenarios and demonstrates superior performance in various downstream tasks.

GIM: Learning Generalizable Image Matcher From Internet Videos

Video object matching across multiple non-overlapping camera views based on multi-feature fusion and incremental learning.

GIM: A Million-scale Benchmark for Generative Image Manipulation Detection and Localization

RGM: A Robust Generalizable Matching Model

OmniGlue: Generalizable Feature Matching with Foundation Model Guidance

Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching

GeoMIM: Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding

Learning Image Matching by Simply Watching Video

Video Instance Segmentation Using Graph Matching Transformer

General Object Foundation Model for Images and Videos at Scale

Omni-IML: Towards Unified Image Manipulation Localization

RGM: A Robust Generalist Matching Model.

MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer

Image Matching: An Application-oriented Benchmark

GoMatching: A Simple Baseline for Video Text Spotting via Long and Short Term Matching

VideoGLUE: Video General Understanding Evaluation of Foundation Models

GIMS: Image Matching System Based on Adaptive Graph Construction and Graph Neural Network

Cross-Domain Visual Matching via Generalized Similarity Measure and Feature Learning

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

ZIM: Zero-Shot Image Matting for Anything