Abstract:Medicinal chemistry patents contain rich information about chemical compounds. Although much effort has been devoted to extracting chemical entities from scientific literature, limited numbers of patent mining systems are publically available, probably due to the lack of large manually annotated corpora. To accelerate the development of information extraction systems for medicinal chemistry patents, the 2015 BioCreative V challenge organized a track on Chemical and Drug Named Entity Recognition from patent text (CHEMDNER patents). This track included three individual subtasks: (i) Chemical Entity Mention Recognition in Patents (CEMP), (ii) Chemical Passage Detection (CPD) and (iii) Gene and Protein Related Object task (GPRO). We participated in the two subtasks of CEMP and CPD using machine learning-based systems. Our machine learning-based systems employed the algorithms of conditional random fields (CRF) and structured support vector machines (SSVMs), respectively. To improve the performance of the NER systems, two strategies were proposed for feature engineering: (i) domain knowledge features of dictionaries, chemical structural patterns and semantic type information present in the context of the candidate chemical and (ii) unsupervised feature learning algorithms to generate word representation features by Brown clustering and a novel binarized Word embedding to enhance the generalizability of the system. Further, the system output for the CPD task was yielded based on the patent titles and abstracts with chemicals recognized in the CEMP task.The effects of the proposed feature strategies on both the machine learning-based systems were investigated. Our best system achieved the second best performance among 21 participating teams in CEMP with a precision of 87.18%, a recall of 90.78% and aF-measure of 88.94% and was the top performing system among nine participating teams in CPD with a sensitivity of 98.60%, a specificity of 87.21%, an accuracy of 94.75%, a Matthew's correlation coefficient (MCC) of 88.24%, a precision at full recall (P_full_R) of 66.57% and an area under the precision-recall curve (AUC_PR) of 0.9347. The SSVM-based CEMP systems outperformed the CRF-based CEMP systems when using the same features. Features generated from both the domain knowledge and unsupervised learning algorithms significantly improved the chemical NER task on patents.Database URL:http:// database. oxfordjournals. org/ content/ 2016/ baw049.

A co-training based method for chinese patent semantic annotation.

Incremental Patent Semantic Annotation Based On Keyword Extraction And List Extraction

An Ontology-Based Automatic Semantic Annotation Approach for Patent Document Retrieval in Product Innovation Design

A Deep Learning Based Method for Extracting Semantic Information from Patent Documents

Exploiting Semantic Knowledge Base for Patent Retrieval

Automatic Abstraction of Long Chinese Patent Texts Based on P-Bertsum Model

The patent mining analysis method based on Chinese word segmentation

A Semantic Query Expansion-Based Patent Retrieval Approach

A Deep Learning Based Method Benefiting from Characteristics of Patents for Semantic Relation Classification

An Automatic Generation Method of Patent Specification Abstract Based on "Extraction- Abstraction "Model

Towards Accurate Word Segmentation for Chinese Patents

Knowledge Powered Cooperative Semantic Fusion for Patent Classification

Sentence-Ranking-Enhanced Keywords Extraction from Chinese Patents

A Patent Keyword Extraction Method Based on Corpus Classification

Chinese Patent Mining Based on Sememe Statistics and Key-Phrase Extraction

An Automatic Information Extraction Method Based on the Characteristics of Patent

Hierarchical multi-instance multi-label learning for Chinese patent text classification

Automatic summarization of long text of Chinese patents based on PatBertsum model

PatSTEG: Modeling Formation Dynamics of Patent Citation Networks via The Semantic-Topological Evolutionary Graph

Chemical named entity recognition in patents by domain knowledge and unsupervised feature learning

A Neural Network Approach to Chemical and Gene/protein Entity Recognition in Patents