Rich Feature Set, Unification of Bidirectional Parsing and Dictionary Filtering for High F-Score Gene Mention Tagging.
Cheng-Ju Kuo,Yu-Ming Chang,Han-Shen Huang,Kuan-Ting Lin,Bo-Hou Yang,Yu-Shi Lin,Chun-Nan Hsu,I. Chung
Abstract:In the first BioCreative (2004) [3], conditional random fields (CRFs) [5] were employed in tagging gene and protein mentioned in the biomedical text with high performance [8]. Therefore, we chose CRFs as our starting point and carefully selected a rich set of 5,059,368 predicates as the features. To further improve its performance, we combined the tagging results of forward and backward parsing [4]. We tried different combination methods, including set operations and Co-Training [1]. However, we found that Co-Training performed poorly. Instead, we selected the best solutions from the " adjacent " ten candidates of bidirectional parsing and then applied dictionary filtering to obtain the best F-score result. Details are given as follows. We applied MALLET [7] to take advantage of its feature induction capability [6]. Due to the special characteristics of name-entities of genes and gene products [10], a rich set of features is required. Not all features proposed in previous work are useful. After hundreds of trials, we carefully selected predicates shown in Table 1 as our feature set, which includes commonly used orthographic predicates and character-n-gram predicates for 2 ≤ n ≤ 4 [8]. We used {−2, −1, 0, 1, 2} as the offsets and evaluated predicates such as word, stemmed word, part-of-speech tag, and word morphology as the contextual features at each position. Our domain-specific features include nucleotide (i.e., types of DNA or RNA), residues of amino acids, etc. We excluded prefix and suffix predicates used in previous work because we found that they usually increase false positive. To extract features, the Genia Tagger [9] was applied for stemming, tokenization and part-of-speech tagging. We modified the Genia Tagger slightly to tokenize words with a higher granularity. For example, punctuation symbols within words were segmented. We also applied a rule-based filter to clean up some easily fixed mistakes, such as entities with unpaired parentheses or square brackets. The performance of the CRF models with this feature set and the rule-based filter is given in the first row of Table 2, which is already slightly better than previously reported figures. These inside test results were obtained by randomly selected 10,000 sentences for training and the rest for testing from the training data set provided by the organizers. To further improve its performance, we combined the tagging results of forward and backward parsing. In forward parsing, the tagger reads and tags the input sentences from left to right, while …