Unsupervised Traditional Chinese Herb Mention Normalization Via Robustness-Promotion Oriented Self-supervised Training.

Wei Li,Zheng Yang,Yanqiu Shao
DOI: https://doi.org/10.1007/978-981-99-8850-1_42
2024-01-01
Abstract:Herbal prescriptions are a vital aspect of Traditional Chinese Medicine (TCM) treatment. The textual representations of herbs can vary significantly across various TCM documents and records. To enhance the utilization of this valuable knowledge in contemporary settings, we propose the objective of Traditional Chinese Herb Mention Normalization by associating them with standardized modern names. However, supervised approaches face the challenge of data sparsity, as they require a substantial amount of labeled data, which is particularly expensive to acquire in the context of TCM. Previous self-alignment methods solely focus on the mentions and names in the gazetteer, overlooking crucial contextual information. Drawing from the observation that mentions often exhibit shared characters with canonical names and possess similar contextual information related to the targeted symptoms and co-occurring herbs, we propose an unsupervised method focusing on promoting robustness. This is achieved by training the model with a self-supervised objective of recovering the original standard herb mentions from the perturbed ones, while leveraging a pretrained language model to capture context information. We argue that the model can develop the alignment ability by making the representation immune to the possible perturbations. To evaluate the effectiveness of the proposed method, we construct a dataset an ancient TCM record dataset. We then enlist TCM professionals to manually annotate the most prevalent aliases of the herbs. Our method achieves an accuracy of 89.79, which is practicable in the real-life scenarios. Extensive analysis further validate the efficacy of the proposed unsupervised method.
What problem does this paper attempt to address?