HILL: Hierarchy-aware Information Lossless Contrastive Learning for Hierarchical Text Classification

He Zhu,Junran Wu,Ruomei Liu,Yue Hou,Ze Yuan,Shangzhe Li,Yicheng Pan,Ke Xu
2024-03-26
Abstract:Existing self-supervised methods in natural language processing (NLP), especially hierarchical text classification (HTC), mainly focus on self-supervised contrastive learning, extremely relying on human-designed augmentation rules to generate contrastive samples, which can potentially corrupt or distort the original information. In this paper, we tend to investigate the feasibility of a contrastive learning scheme in which the semantic and syntactic information inherent in the input sample is adequately reserved in the contrastive samples and fused during the learning process. Specifically, we propose an information lossless contrastive learning strategy for HTC, namely \textbf{H}ierarchy-aware \textbf{I}nformation \textbf{L}ossless contrastive \textbf{L}earning (HILL), which consists of a text encoder representing the input document, and a structure encoder directly generating the positive sample. The structure encoder takes the document embedding as input, extracts the essential syntactic information inherent in the label hierarchy with the principle of structural entropy minimization, and injects the syntactic information into the text representation via hierarchical representation learning. Experiments on three common datasets are conducted to verify the superiority of HILL.
Computation and Language,Information Theory
What problem does this paper attempt to address?
This paper proposes a new method called HILL (Hierarchy-aware Information Lossless Contrastive Learning) for hierarchical text classification tasks. Existing self-supervised methods, especially contrastive learning methods, mainly rely on artificially designed data augmentation rules to generate contrastive samples, which may destroy the integrity of the original information. HILL aims to preserve the semantic and syntactic information in input samples and integrate this information during the learning process. Specifically, HILL consists of a text encoder to represent input documents and a structure encoder to directly generate positive samples. The structure encoder takes document embeddings as input, extracts key syntactic information in the label hierarchy through the principle of structure entropy minimization, and injects this syntactic information into text representation through hierarchical representation learning. Experiments on three common datasets validate the superiority of HILL. The main contributions of the paper include: 1. Proposing an algorithm based on structure entropy to decode the key information of the label hierarchy in a lossless manner, supporting semantic analysis for hierarchical text classification. 2. Introducing the HILL framework, which incorporates syntactic information of the label hierarchy into document embeddings while maximizing the preservation of semantic information in input documents. 3. Defining the concept of information lossless learning and proving that HILL retains more information than any data augmentation-based methods. 4. Achieving significant performance improvement compared to other contrastive learning and supervised learning methods on three datasets. In summary, HILL improves contrastive learning methods for hierarchical text classification tasks by better utilizing structural information and preserving the semantic information of the text.