Long Text Classification with Segmentation

Ekaterina Mayatskaya
DOI: https://doi.org/10.1109/USBEREIT61901.2024.10583985
2024-05-13
Abstract:In text document classification, researchers often deal with long documents, which creates difficulties in efficient text processing. When processing long sequences such as long documents, BERT and similar transformers face a limitation on the amount of input data. This paper addresses the problem on handling long texts that arises when working with transformer models. Most often, researchers perform text truncation when dealing with long documents. In this paper, we segment the input text into several fragments and feed each of them into the base model. Then we merge the resulting vector representations for each chunk of text. To handle long vectors efficiently, we segment the texts into batches based on the length of the sequence and create a mask before feeding the data into the recurrent neural network. We performed a performance analysis and demonstrated that our method achieved competitive results on two medical datasets, showing a significant reduction in model training time, compared to baseline methods.
Computer Science
What problem does this paper attempt to address?