Breaking the Token Barrier: Chunking and Convolution for Efficient Long Text Classification with BERT

Aman Jaiswal,Evangelos Milios

2023-10-31

Abstract:Transformer-based models, specifically BERT, have propelled research in various NLP tasks. However, these models are limited to a maximum token limit of 512 tokens. Consequently, this makes it non-trivial to apply it in a practical setting with long input. Various complex methods have claimed to overcome this limit, but recent research questions the efficacy of these models across different classification tasks. These complex architectures evaluated on carefully curated long datasets perform at par or worse than simple baselines. In this work, we propose a relatively simple extension to vanilla BERT architecture called ChunkBERT that allows finetuning of any pretrained models to perform inference on arbitrarily long text. The proposed method is based on chunking token representations and CNN layers, making it compatible with any pre-trained BERT. We evaluate chunkBERT exclusively on a benchmark for comparing long-text classification models across a variety of tasks (including binary classification, multi-class classification, and multi-label classification). A BERT model finetuned using the ChunkBERT method performs consistently across long samples in the benchmark while utilizing only a fraction (6.25\%) of the original memory footprint. These findings suggest that efficient finetuning and inference can be achieved through simple modifications to pre-trained BERT models.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve This paper aims to address the limitations of the BERT model when handling long text classification tasks. Specifically: 1. **Input Length Limitation of the BERT Model**: Currently, the maximum input length limit for the BERT model is 512 tokens, which makes it difficult to process long texts in practice. 2. **Effectiveness of Existing Methods**: Although various complex methods claim to overcome this limitation, recent studies have shown that these models do not necessarily perform better than simple baseline models (such as truncating long texts) on different classification tasks. The paper proposes a relatively simple extension method—ChunkBERT, which allows the pre-trained BERT model to handle texts of any length without significantly increasing computational resources, and performs well on multiple long text classification tasks. Specifically, ChunkBERT achieves effective inference of long texts by chunking the text and using Convolutional Neural Network (CNN) layers to process these chunks. Experimental results show that ChunkBERT performs excellently on multiple benchmark datasets, particularly in complex multi-label classification tasks, significantly outperforming other methods while occupying only about 6.25% of the memory of the original BERT model.

Breaking the Token Barrier: Chunking and Convolution for Efficient Long Text Classification with BERT

Long Text Classification Based on BERT

CogLTX: Applying BERT to Long Texts.

Hierarchical Transformers for Long Document Classification

Compressing BERT for Binary Text Classification Via Adaptive Truncation Before Fine-Tuning

Pretraining without wordpieces: learning over a vocabulary of millions of words

BERTwich: Extending BERT's Capabilities to Model Dialectal and Noisy Text

Long Text Classification with Segmentation

LordBERT: Embedding Long Text by Segment Ordering with BERT

Breaking MLPerf Training: A Case Study on Optimizing BERT

HBert: A Long Text Processing Method Based on BERT and Hierarchical Attention Mechanisms

NarrowBERT: Accelerating Masked Language Model Pretraining and Inference

Input-length-shortening and text generation via attention values

EarlyBERT: Efficient BERT Training Via Early-bird Lottery Tickets

Blockwise Self-Attention for Long Document Understanding

Boosting Distributed Training Performance of the Unpadded BERT Model

Enhancing BERT for Short Text Classification with Latent Information

Distillation for Text Classification Task Based on BERT

BiBERT: Accurate Fully Binarized BERT

BERTer: The Efficient One

Limitations of Transformers on Clinical Text Classification