Internet Explorer: Targeted Representation Learning on the Open Web

Alexander C. Li,Ellis Brown,Alexei A. Efros,Deepak Pathak

2023-09-07

Abstract:Modern vision models typically rely on fine-tuning general-purpose models pre-trained on large, static datasets. These general-purpose models only capture the knowledge within their pre-training datasets, which are tiny, out-of-date snapshots of the Internet -- where billions of images are uploaded each day. We suggest an alternate approach: rather than hoping our static datasets transfer to our desired tasks after large-scale pre-training, we propose dynamically utilizing the Internet to quickly train a small-scale model that does extremely well on the task at hand. Our approach, called Internet Explorer, explores the web in a self-supervised manner to progressively find relevant examples that improve performance on a desired target dataset. It cycles between searching for images on the Internet with text queries, self-supervised training on downloaded images, determining which images were useful, and prioritizing what to search for next. We evaluate Internet Explorer across several datasets and show that it outperforms or matches CLIP oracle performance by using just a single GPU desktop to actively query the Internet for 30--40 hours. Results, visualizations, and videos at <a class="link-external link-https" href="https://internet-explorer-ssl.github.io/" rel="external noopener nofollow">this https URL</a>

Machine Learning,Artificial Intelligence,Computer Vision and Pattern Recognition,Neural and Evolutionary Computing,Robotics

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve The paper "Internet Explorer: Targeted Representation Learning on the Open Web" attempts to solve the problem of how to leverage dynamic data from the internet to improve visual model representations for specific tasks. Traditional deep learning models typically rely on large-scale, static datasets for pre-training, followed by fine-tuning on small-scale datasets for specific tasks. However, these static datasets often fail to capture the rich, continuously updated information available on the internet, leading to suboptimal performance when models encounter new data. Specifically, the paper proposes a method called **Internet Explorer** to address this issue through the following steps: 1. **Dynamic Utilization of Internet Data**: Unlike traditional static datasets, Internet Explorer views the internet as a dynamic, open data source. It incrementally finds image data relevant to the target task through a self-supervised approach. 2. **Self-Supervised Exploration**: The method uses text queries to search engines, downloads relevant images, and performs self-supervised training to enhance performance on the target dataset. 3. **Continuous Query Optimization**: Internet Explorer continuously evaluates the contribution of downloaded images to the target dataset and adjusts subsequent query strategies based on this feedback, thereby gradually improving the quality of model representations. Through this approach, the paper aims to overcome the limitations of static datasets and efficiently enhance model performance for specific tasks by leveraging the rich resources available on the internet.

Internet Explorer: Targeted Representation Learning on the Open Web

RECL: Responsive Resource-Efficient Continuous Learning for Video Analytics

AI Online Filters to Real World Image Recognition

Selective Visual Representations Improve Convergence and Generalization for Embodied AI

Modeling Human Visual Search Performance on Realistic Webpages Using Analytical and Deep Learning Methods

Enabling the Network to Surf the Internet

INTERN: A New Learning Paradigm Towards General Vision

Open Long-Tailed Recognition In A Dynamic World

Exploiting Web Images for Fine-Grained Visual Recognition by Eliminating Open-Set Noise and Utilizing Hard Examples

Exploring Simple and Transferable Recognition-Aware Image Processing

Effectiveness of Vision Language Models for Open-world Single Image Test Time Adaptation

Vision Learners Meet Web Image-Text Pairs

A High-Throughput Screening Approach to Discovering Good Forms of Biologically Inspired Visual Representation

A Novel Deep Learning-Based Visual Search Engine in Digital Marketing for Tourism E-Commerce Platforms

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

Addressing Sample Inefficiency in Multi-View Representation Learning

Off-policy Imitation Learning from Visual Inputs

Interactive Classification for Deep Learning Interpretation

DeIl: Direct and Inverse CLIP for Open-World Few-Shot Learning

Real-World Robot Learning with Masked Visual Pre-training