Soumyadeep Roy,Sudip Chakraborty,Aishik Mandal,Gunjan Balde,Prakhar Sharma,Anandhavelu Natarajan,Megha Khosla,Shamik Sural,Niloy Ganguly
Abstract:Online medical forums have become a predominant platform for answering health-related information needs of consumers. However, with a significant rise in the number of queries and the limited availability of experts, it is necessary to automatically classify medical queries based on a consumer's intention, so that these questions may be directed to the right set of medical experts. Here, we develop a novel medical knowledge-aware BERT-based model (MedBERT) that explicitly gives more weightage to medical concept-bearing words, and utilize domain-specific side information obtained from a popular medical knowledge base. We also contribute a multi-label dataset for the Medical Forum Question Classification (MFQC) task. MedBERT achieves state-of-the-art performance on two benchmark datasets and performs very well in low resource settings.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to automatically classify user questions in medical forums so as to direct these questions to appropriate medical experts. Specifically, the author has developed a new BERT - based model (M/e.sc/d.scBERT), which can more accurately identify and classify questions in medical forums and assign them to the correct categories according to the user's intention.
### Background and Problem Description of the Paper
With the rise of online medical forums, more and more consumers obtain health - related information through these platforms. However, due to the limited number of medical professionals, it is impossible to meet the needs of all inquiries. Therefore, an automated system is required to help classify and process these inquiries. Specifically, this paper focuses on the **Medical Forum Question Classification (MFQC)** task, that is, classifying questions in medical forums according to the intention of users' posts.
### Main Challenges
1. **Large and Complex Data**: There are a large number of questions in medical forums, and they involve multiple categories of health information needs.
2. **Differences between Professional Terms and Everyday Language**: There are differences between the vocabulary used by consumers and professional medical terms, which makes it difficult for traditional methods to accurately classify.
3. **Limitations of Existing Methods**: Existing medical question classification methods usually rely on hand - designed features or pre - trained word vectors, which will lead to the loss of context information and poor generalization ability on test data.
### Solutions
To solve the above problems, the author proposes a new model M/e.sc/d.scBERT based on a dual - encoder architecture. The main features of this model are as follows:
1. **Using Pre - trained Models to Extract Context Representations**: Use pre - trained language models such as BERT to extract the global context representation of the input text, thereby retaining more context information.
2. **Introducing Medical Domain Knowledge as Auxiliary Information**: Extract medical concept words from the medical knowledge base and assign higher weights to these words, so that the model can better understand the specific terms in the medical field.
3. **Combining Global and Local Context Representations**: Improve the accuracy of classification by fusing the global context representation (considering the context of the entire sentence) and the local context representation (especially emphasizing medical concept words).
### Experimental Results
The author conducted experiments on two benchmark datasets, ICHI and CADEC. The results show that M/e.sc/d.scBERT has achieved state - of - the - art performance in both single - label and multi - label classification tasks. Especially in the low - resource setting, the performance of M/e.sc/d.scBERT is particularly prominent, significantly outperforming other baseline models.
### Conclusions
By introducing medical domain knowledge and combining the powerful context representation ability of pre - trained language models, the M/e.sc/d.scBERT model has shown excellent performance in the medical forum question classification task. Future work can further expand the application range of this model, for example, for structured prediction tasks such as entity and relationship prediction.
### Formula Examples
In the paper, the author mentions that the final classification score is calculated by the following formula:
\[
\mathbf{C} = [\mathbf{v}_{local}; \mathbf{v}_{global}]
\]
where $\mathbf{v}_{local}$ and $\mathbf{v}_{global}$ are the vectors of local and global context representations respectively. The final classification score is projected into the space of target categories through a fully - connected layer:
\[
\mathbf{C}_{pred} = W \mathbf{C} + b
\]
Then the Softmax function is applied to obtain the posterior probability distribution:
\[
P(y_i | \mathbf{C}) = \frac{\exp(C_{pred,i})}{\sum_j \exp(C_{pred,j})}
\]
Hopefully, this information can help you better understand the core content of this paper and its solutions. If you have more questions, feel free to continue asking!