Abstract:Deep networks are frequently tuned to novel tasks and continue learning from ongoing data streams. Such sequential training requires consolidation of new and past information, a challenge predominantly addressed by retaining the most important data points - formally known as coresets. Traditionally, these coresets consist of entire samples, such as images or sentences. However, recent transformer architectures operate on tokens, leading to the famous assertion that an image is worth 16x16 words. Intuitively, not all of these tokens are equally informative or memorable. Going beyond coresets, we thus propose to construct a deeper-level data summary on the level of tokens. Our respectively named core tokensets both select the most informative data points and leverage feature attribution to store only their most relevant features. We demonstrate that core tokensets yield significant performance retention in incremental image classification, open-ended visual question answering, and continual image captioning with significantly reduced memory. In fact, we empirically find that a core tokenset of 1\% of the data performs comparably to at least a twice as large and up to 10 times larger coreset.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of how to effectively retain and utilize the information of past tasks during the continuous training process of deep - learning models, in order to avoid catastrophic forgetting, while significantly reducing memory usage. Specifically, the author proposes a new data summarization method - **core tokensets** - to address the following challenges: 1. **Information retention in continuous learning**: When deep - learning models need to continuously adapt to new tasks, how to ensure that the model does not forget the previously learned knowledge is a key issue. Traditional solutions are to select important data points (i.e., coreset), but this method has limited effectiveness when dealing with modern transformer architectures. 2. **Reduction of memory usage**: As the amount of data grows, it becomes impractical to store a large amount of historical data for replay. Therefore, how to reduce memory usage while maintaining performance is an important research direction. 3. **Characteristics of the Transformer architecture**: Modern Transformer models represent input data as a series of tokens (for example, an image is divided into multiple patches). However, not all tokens are equally important. Therefore, how to select the most representative tokens to construct an efficient data summary has become a new research topic. ### Main contributions of the paper - **Introduction of core tokensets**: This is a new data summarization method that not only selects the most important data instances but also further selects the most relevant tokens among these instances. - **Use of the attention mechanism to select important tokens**: By analyzing the attention weights of each layer of the Transformer, the importance of each token is determined, and based on this, the core tokens are selected. - **Experimental verification**: The effectiveness and memory advantages of the core tokensets are demonstrated on multiple tasks (such as incremental image classification, open - ended visual question answering, and continuous image captioning). Experimental results show that using 1% of the core tokensets can achieve performance comparable to that of the traditional coreset (which contains 10 times the amount of data). ### Formula summary - **Definition of Coreset**: \[ |cost(D, Q)-cost(C_x, Q)|\leq\epsilon\cdot cost(D, Q) \] where \(C_x\) is a subset selected from the dataset \(D\), \(Q\) is the model parameter, and \(\epsilon\) is a positive number less than 1. - **Definition of Core Tokenset**: \[ |cost(T, Q)-cost(C_t, Q)|\leq\epsilon\cdot cost(T, Q) \] where \(C_t\) is a subset selected from the token set \(T\). - **Training loss function**: \[ L = E_{(x,y)}[L(f(x), y)]+E_{(x_c, y)\in C_t}[L(f(x_c), y)] \] where \(L\) is the loss function, \(f(x)\) is the model prediction, and \(C_t\) is the core tokenset. Through these methods, the paper provides a more efficient and more expressive data summarization method, which is especially suitable for the continuous training scenario of modern Transformer architectures.

Core Tokensets for Data-efficient Sequential Training of Transformers

Data-Efficient Training of CNNs and Transformers with Coresets: A Stability Perspective

Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning

An Integrated Transformer with Collaborative Tokens Mining for Fine-Grained Recognition

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition

Not All Images Are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length

Vision Transformer with Super Token Sampling

Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot

Transformer with token attention and attribute prediction for image captioning

Towards Sustainable Learning: Coresets for Data-efficient Deep Learning

Remote Sensing Scene Classification via Second-Order Differentiable Token Transformer Network

Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for Few-shot Learning

Transformer Compressed Sensing via Global Image Tokens

Demystify Transformers & Convolutions in Modern Image Deep Networks

Hybrid Token Transformer for Deep Face Recognition

An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Looking Beyond The Top-1: Transformers Determine Top Tokens In Order

A General and Efficient Training for Transformer via Token Expansion

A Spitting Image: Modular Superpixel Tokenization in Vision Transformers