Breaking the Token Barrier: Chunking and Convolution for Efficient Long Text Classification with BERT

Aman Jaiswal,Evangelos Milios
2023-10-31
Abstract:Transformer-based models, specifically BERT, have propelled research in various NLP tasks. However, these models are limited to a maximum token limit of 512 tokens. Consequently, this makes it non-trivial to apply it in a practical setting with long input. Various complex methods have claimed to overcome this limit, but recent research questions the efficacy of these models across different classification tasks. These complex architectures evaluated on carefully curated long datasets perform at par or worse than simple baselines. In this work, we propose a relatively simple extension to vanilla BERT architecture called ChunkBERT that allows finetuning of any pretrained models to perform inference on arbitrarily long text. The proposed method is based on chunking token representations and CNN layers, making it compatible with any pre-trained BERT. We evaluate chunkBERT exclusively on a benchmark for comparing long-text classification models across a variety of tasks (including binary classification, multi-class classification, and multi-label classification). A BERT model finetuned using the ChunkBERT method performs consistently across long samples in the benchmark while utilizing only a fraction (6.25\%) of the original memory footprint. These findings suggest that efficient finetuning and inference can be achieved through simple modifications to pre-trained BERT models.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve This paper aims to address the limitations of the BERT model when handling long text classification tasks. Specifically: 1. **Input Length Limitation of the BERT Model**: Currently, the maximum input length limit for the BERT model is 512 tokens, which makes it difficult to process long texts in practice. 2. **Effectiveness of Existing Methods**: Although various complex methods claim to overcome this limitation, recent studies have shown that these models do not necessarily perform better than simple baseline models (such as truncating long texts) on different classification tasks. The paper proposes a relatively simple extension method—ChunkBERT, which allows the pre-trained BERT model to handle texts of any length without significantly increasing computational resources, and performs well on multiple long text classification tasks. Specifically, ChunkBERT achieves effective inference of long texts by chunking the text and using Convolutional Neural Network (CNN) layers to process these chunks. Experimental results show that ChunkBERT performs excellently on multiple benchmark datasets, particularly in complex multi-label classification tasks, significantly outperforming other methods while occupying only about 6.25% of the memory of the original BERT model.