A GAN-BO-XGBoost model for high-quality patents identification

Zengyuan Wu,Jiali Zhao,Ying Li,Zelin Wang,Bin He,Liang Chen
DOI: https://doi.org/10.1038/s41598-024-60173-9
IF: 4.6
2024-04-27
Scientific Reports
Abstract:The number of patents increases quickly, while more and more low-quality patents are emerging. It's important to identify high-quality patents from massive data quickly and accurately for organizational R&D decision-making and patent layout. However, due to low percentage of high-quality patents, it is challenging to identify them efficiently. In order to solve above problem, we reconstruct the existing index system for identifying high-quality patents by adding 4 features from technological strength of patentees. Furthermore, we propose an improved model by integrating resampling technique and ensemble learning algorithm. First, generative adversarial networks (GAN) are used to expand minority samples. Second, Extreme Gradient Boosting algorithm (XGBoost) with Bayesian optimization (BO) is used to identify high-quality patents. For clarity, this model is called a GAN-BO-XGBoost model. To test the effectiveness of above model, we use patent data in field of lithography technology. Tenfold cross-validation is carried out to evaluate the performance between our proposed model and other models. The results show that GAN-BO-XGBoost model performs better and it's more stable than other models.
multidisciplinary sciences
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to quickly and accurately identify high - quality patents in a large amount of patent data. With the rapid increase in the number of patents, the proportion of low - quality patents is also rising, which poses a challenge to an organization's R & D decisions and patent layout. Due to the low proportion of high - quality patents, existing methods have difficulties in efficiently identifying these patents. For this reason, the paper proposes a model that combines Generative Adversarial Networks (GAN), Bayesian Optimization (BO) and eXtreme Gradient Boosting algorithm (XGBoost), namely the GAN - BO - XGBoost model, to solve the data imbalance problem and improve the performance of identifying high - quality patents. Specifically, the paper solves the above problems through the following steps: 1. **Data sample augmentation**: Use GAN to generate minority - class samples to balance the class distribution in the dataset and reduce classification bias caused by data imbalance. 2. **Model construction and optimization**: Use the XGBoost algorithm to identify high - quality patents, and adjust the parameters of XGBoost through Bayesian optimization to obtain the best model performance. 3. **Performance evaluation**: Through ten - fold cross - validation, evaluate the performance differences between the proposed GAN - BO - XGBoost model and other machine - learning models. The results show that this model performs better in terms of accuracy, precision, recall, F1 - score and AUC, and is more stable. The main contribution of the paper lies in proposing an effective solution, which overcomes the shortcomings of existing methods in dealing with unbalanced data, and provides new ideas and technical support for the rapid and accurate identification of high - quality patents.