Abstract:Background Gene named entity classification and recognition are crucial preliminary steps of text mining in biomedical literature. Machine learning based methods have been used in this area with great success. In most state-of-the-art systems, elaborately designed lexical features, such as words, n-grams, and morphology patterns, have played a central part. However, this type of feature tends to cause extreme sparseness in feature space. As a result, out-of-vocabulary (OOV) terms in the training data are not modeled well due to lack of information. Results We propose a general framework for gene named entity representation, called feature coupling generalization (FCG). The basic idea is to generate higher level features using term frequency and co-occurrence information of highly indicative features in huge amount of unlabeled data. We examine its performance in a named entity classification task, which is designed to remove non-gene entries in a large dictionary derived from online resources. The results show that new features generated by FCG outperform lexical features by 5.97 F-score and 10.85 for OOV terms. Also in this framework each extension yields significant improvements and the sparse lexical features can be transformed into both a lower dimensional and more informative representation. A forward maximum match method based on the refined dictionary produces an F-score of 86.2 on BioCreative 2 GM test set. Then we combined the dictionary with a conditional random field (CRF) based gene mention tagger, achieving an F-score of 89.05, which improves the performance of the CRF-based tagger by 4.46 with little impact on the efficiency of the recognition system. A demo of the NER system is available at http://202.118.75.18:8080/bioner .

Combining Multi-Models For Gene Mention Tagging

Integrating divergent models for gene mention tagging

Boosting performance of gene mention tagging system by hybrid methods.

Boosting performance of gene mention tagging system by classifiers ensemble

Combining multiple disambiguation methods for gene mention normalization

Overview of BioCreative II gene mention recognition

Incorporating rich background knowledge for gene named entity classification and recognition

A Multistage Gene Normalization System Integrating Multiple Effective Methods

Gene Mention Normalization Based on Semantic Featured Machine Learning Disambiguation

Improve Image Annotation by Combining Multiple Models

Using Machine Learning to Measure Relatedness Between Genes: A Multi-Features Model.

Modeling Voting for System Combination in Machine Translation

GeneSUM: Large Language Model-based Gene Summary Extraction

CGI-MRE: A Comprehensive Genetic-Inspired Model For Multimodal Relation Extraction

Leveraging a Joint learning Model to Extract Mixture Symptom Mentions from Traditional Chinese Medicine Clinical Notes

Geneverse: A collection of Open-source Multimodal Large Language Models for Genomic and Proteomic Research

A tag based joint extraction model for Chinese medical text

Location-Guided Token Pair Tagger for Joint Biomedical Entity and Relation Extraction.

MuSe-GNN: Learning Unified Gene Representation From Multimodal Biological Graph Data

Scme: a Dual-Modality Factor Model for Single-Cell Multiomics Embedding

It's Morphing Time: Unleashing the Potential of Multiple LLMs via Multi-objective Optimization