Abstract:Fine-tuning large pre-trained language models (LLMs) on particular datasets is a commonly employed strategy in Natural Language Processing (NLP) classification tasks. However, this approach usually results in a loss of models generalizability. In this paper, we present a framework that allows for maintaining generalizability, and enhances the performance on the downstream task by utilizing task-specific context attribution. We show that a linear transformation of the text representation from any transformer model using the task-specific concept operator results in a projection onto the latent concept space, referred to as context attribution in this paper. The specific concept operator is optimized during the supervised learning stage via novel loss functions. The proposed framework demonstrates that context attribution of the text representation for each task objective can improve the capacity of the discriminator function and thus achieve better performance for the classification task. Experimental results on three datasets, namely HateXplain, IMDB reviews, and Social Media Attributions, illustrate that the proposed model attains superior accuracy and generalizability. Specifically, for the non-fine-tuned BERT on the HateXplain dataset, we observe 8% improvement in accuracy and 10% improvement in F1-score. Whereas for the IMDB dataset, fine-tuned state-of-the-art XLNet is outperformed by 1% for both accuracy and F1-score. Furthermore, in an out-of-domain cross-dataset test, DistilBERT fine-tuned on the IMDB dataset in conjunction with the proposed model improves the F1-score on the HateXplain dataset by 7%. For the Social Media Attributions dataset of YouTube comments, we observe 5.2% increase in F1-metric. The proposed framework is implemented with PyTorch and provided open-source on GitHub.

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context.

Transformer-xl: Language modeling with longer-term dependency

Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey

Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning

Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation

Transformers are Universal In-context Learners

Dependency Transformer Grammars: Integrating Dependency Structures into Transformer Language Models

Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer

Chunk, Align, Select: A Simple Long-sequence Processing Method for Transformers

Probabilistic Transformer: A Probabilistic Dependency Model for Contextual Word Representation

X-Transformer: A Machine Translation Model Enhanced by the Self-Attention Mechanism

Empower Your Model with Longer and Better Context Comprehension

On The Adaptation of Unlimiformer for Decoder-Only Transformers

Domain-specific Chinese Transformer-XL Language Model with Part-of-speech Information

Blockwise Parallel Transformer for Large Context Models

Segatron: Segment-Aware Transformer for Language Modeling and Understanding

Longformer: The Long-Document Transformer

Length Generalization of Causal Transformers without Position Encoding

Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers

Breaking Free Transformer Models: Task-specific Context Attribution Promises Improved Generalizability Without Fine-tuning Pre-trained LLMs

Lite Transformer with Long-Short Range Attention