Abstract:Patent classification is a necessary step in the efficient processing of patent data and ensuring convenient information access to users. To address the present inefficiency of patent classification, many algorithms and deep learning-based techniques have been developed. However, there is a scarcity of studies on the impacts of preprocessing, word embedding, and data fields on patent classification. In this study, we examined three different scenarios to evaluate and analyze the effects of generalizing words via stemming on the classification performance considering the characteristics of patent data. Comparative experiments between pre-trained word embedding models and embedding models that underwent learning using a newly created patent dataset were conducted. Detailed descriptions of the preprocessing and word embedding techniques are provided. We found that the continuous bag-of-words (CBoW) embedding model that underwent learning using the patent dataset best reflected the words contained in the patent documents, and the hierarchical International Patent Classification (IPC) that is used in more than 100 countries had the biggest impact on the classification performance. Furthermore, the relationship between the number of embedded words and the classification performance was investigated. Finally, we performed classification experiments using different data fields and classification models. When the IPC was incorporated, the classification performance was substantially enhanced, and a high classification accuracy was achieved when a classification model that considered the relationship between labels and words was employed. We used the most commonly used indices, P@N and NDCG@N, to compare the performance of all models. Using the model with the best performance as determined via the aforementioned experiments, accuracies of P @ 1 = 71.896%, P @ 3 = 36.697%, and P @ 5 = 24.301% were obtained using two simple ensembles of LAHA models. We provide an in-depth investigation into patent classification methods that elucidates the effects of various parameters on the patent classification process. The results of this study will serve to improve the efficiency of patent research and classification tasks.

Patent classification by fine-tuning BERT language model

PatentBERT: Patent Classification with Fine-Tuning a pre-trained BERT Model

BERT-CNN: a Hierarchical Patent Classifier Based on a Pre-Trained Language Model

DeepPatent: patent classification with convolutional neural networks and word embedding

PatentSBERTa: A Deep NLP based Hybrid Model for Patent Distance and Classification using Augmented SBERT

How to Fine-Tune BERT for Text Classification?

Multi label classification of Artificial Intelligence related patents using Modified D2SBERT and Sentence Attention mechanism

Automatic Abstraction of Long Chinese Patent Texts Based on P-Bertsum Model

Impact of preprocessing and word embedding on extreme multi-label patent classification tasks

Enhancing patent text classification with Bi-LSTM technique and alpine skiing optimization for improved diagnostic accuracy

PatentGPT: A Large Language Model for Patent Drafting Using Knowledge-based Fine-tuning Method

Empirical Study of LLM Fine-Tuning for Text Classification in Legal Document Review

PaECTER: Patent-level Representation Learning using Citation-informed Transformers

A Patent Keyword Extraction Method Based on Corpus Classification

Improving reference mining in patents with BERT

Parameter tuning Naïve Bayes for automatic patent classification

Large Language Model Informed Patent Image Retrieval

Single task fine-tune BERT for text classification

Supervised Approaches to Assign Cooperative Patent Classification (CPC) Codes to Patents

Fine-Tuning Language Models on Multiple Datasets for Citation Intention Classification

RoBERTa-wwm-ext Fine-Tuning for Chinese Text Classification