Abstract:Abstract Purpose Many science, technology and innovation (STI) resources are attached with several different labels. To assign automatically the resulting labels to an interested instance, many approaches with good performance on the benchmark datasets have been proposed for multilabel classification task in the literature. Furthermore, several open-source tools implementing these approaches have also been developed. However, the characteristics of real-world multilabel patent and publication datasets are not completely in line with those of benchmark ones. Therefore, the main purpose of this paper is to evaluate comprehensively seven multi-label classification methods on real-world datasets. Design/methodology/approach Three real-world datasets (Biological-Sciences, Health-Sciences, and USPTO) from SciGraph and USPTO database are constructed. Seven multilabel classification methods with tuned parameters (dependency-LDA, ML k NN, LabelPowerset, RA k EL, TextCNN, TexRNN, and TextRCNN) are comprehensively compared on these three real-world datasets. To evaluate the performance, the study adopts three classification-based metrics: Macro-F1, Micro-F1, and Hamming Loss. Findings The TextCNN and TextRCNN models show obvious superiority on small-scale datasets with more complex hierarchical structure of labels and more balanced documentlabel distribution in terms of macro-F1, micro-F1 and Hamming Loss. The ML k NN method works better on the larger-scale dataset with more unbalanced document-label distribution. Research limitations Three real-world datasets differ in the following aspects: statement, data quality, and purposes. Additionally, open-source tools designed for multi-label classification also have intrinsic differences in their approaches for data processing and feature selection, which in turn impacts the performance of a multi-label classification approach. In the near future, we will enhance experimental precision and reinforce the validity of conclusions by employing more rigorous control over variables through introducing expanded parameter settings. Practical implications The observed Macro F1 and Micro F1 scores on real-world datasets typically fall short of those achieved on benchmark datasets, underscoring the complexity of real-world multi-label classification tasks. Approaches leveraging deep learning techniques offer promising solutions by accommodating the hierarchical relationships and interdependencies among labels. With ongoing enhancements in deep learning algorithms and large-scale models, it is expected that the efficacy of multi-label classification tasks will be significantly improved, reaching a level of practical utility in the foreseeable future. Originality/value (1) Seven multi-label classification methods are comprehensively compared on three real-world datasets. (2) The TextCNN and TextRCNN models perform better on small-scale datasets with more complex hierarchical structure of labels and more balanced document-label distribution. (3) The ML k NN method works better on the larger-scale dataset with more unbalanced document-label distribution.

Hierarchical multi-instance multi-label learning for Chinese patent text classification

Hierarchical and Bidirectional Joint Multi-Task Classifiers for Natural Language Understanding

Multi-Label Patent Categorization with Non-Local Attention-Based Graph Convolutional Network

A Multi-task Approach to Neural Multi-label Hierarchical Patent Classification Using Transformers

Multi-label classification of legal text based on label embedding and capsule network

Hierarchical Multi-label Text Classification: An Attention-based Recurrent Network Approach

Multi-label Text Classification Model Based on Multi-level Constraint Augmentation and Label Association Attention

MFLSCI: Multi-granularity fusion and label semantic correlation information for multi-label legal text classification

Adaptive Taxonomy Learning and Historical Patterns Modelling for Patent Classification

Reliable Multi-View Deep Patent Classification

Recent Advances in Hierarchical Multi-label Text Classification: A Survey

Hierarchical Multi-Granularity Attention- Based Hybrid Neural Network for Text Classification.

BERT-CNN: a Hierarchical Patent Classifier Based on a Pre-Trained Language Model

Performance evaluation of seven multi-label classification methods on real-world patent and publication datasets

Hierarchical Taxonomy-Aware and Attentional Graph Capsule RCNNs for Large-Scale Multi-Label Text Classification

HAIN: Multi-label Classification with Hierarchical Attention-based Interaction Network for Multi-turn Dialogue Texts

Hierarchical Multilabel Text Classification Via Multitask Learning.

Multi-label Classification and Interactive NLP-based Visualization of Electric Vehicle Patent Data

Solution for the EPO CodeFest on Green Plastics: Hierarchical multi-label classification of patents relating to green plastics using deep learning

Hierarchical Inter-Attention Network for Document Classification with Multi-Task Learning.