Abstract:The core of evidence-based medicine is to read and analyze numerous papers in the medical literature on a specific clinical problem and summarize the authoritative answers to that problem. Currently, to formulate a clear and focused clinical problem, the popular PICO framework is usually adopted, in which each clinical problem is considered to consist of four parts: patient/problem (P), intervention (I), comparison (C) and outcome (O). In this study, we compared several classification models that are commonly used in traditional machine learning. Next, we developed a multitask classification model based on a soft-margin SVM with a specialized feature engineering method that combines 1-2gram analysis with TF-IDF analysis. Finally, we trained and tested several generic models on an open-source data set from BioNLP 2018. The results show that the proposed multitask SVM classification model based on 1-2gram TF-IDF features exhibits the best performance among the tested models.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to automatically extract PICO elements from the abstracts of randomized controlled trials (RCTs). Specifically, the author aims to improve the efficiency of identifying the four components - patients/problems (P), interventions (I), comparisons (C), and outcomes (O) - from structured abstracts through machine - learning methods, in order to support the research and application of evidence - based medicine (EBM).
### Problem Background
In evidence - based medicine, researchers need to read and analyze a large amount of literature to summarize authoritative answers to specific clinical questions. In order to clarify and focus on clinical questions, the PICO framework is usually adopted, in which each clinical question is decomposed into four parts: patients/problems (P), interventions (I), comparisons (C), and outcomes (O). However, since these PICO elements are not clearly labeled in the structured abstracts of most medical papers, the literature retrieval and screening work is very time - consuming. Therefore, being able to automatically extract PICO elements from structured abstracts will greatly improve the work efficiency of evidence - based medicine.
### Main Contributions of the Paper
1. **Feature Engineering Method**: The author proposes a feature engineering method based on TF - IDF (term frequency - inverse document frequency) and combines it with 1 - 2gram analysis.
2. **Multi - task Classification Model**: Develops a multi - task classification model based on soft - margin SVM (soft - margin support vector machine) for automatically extracting PICO elements at the sentence level.
3. **Experimental Verification**: Verifies the effectiveness of the 1 - 2gram model through six groups of controlled experiments and compares its performance with the word2vec word embedding method.
4. **Comparison with Other Classic Methods**: Compares the proposed model with classic classification methods such as random forest (RF), XGBoost, naive Bayes (NB), and long - short - term memory network (LSTM), and the results show that the proposed model is superior in performance.
### Formula Display
- **Term Frequency (TF) Formula**:
\[
tf_{i,j}=\frac{n_{i,j}}{\sum_{k}n_{k,j}}
\]
where \(n_{i,j}\) is the number of times the word \(w_i\) appears in the sentence \(s_j\), and the denominator is the sum of the number of times all words appear in the sentence \(s_j\).
- **Inverse Document Frequency (IDF) Formula**:
\[
idf_i = \log\frac{|D|}{|\{j:w_i\in s_j\}| + 1}
\]
where \(|D|\) is the total number of sentences in the data set, \(|\{j:w_i\in s_j\}| \) is the number of sentences containing the word \(w_i\), and adding 1 is to prevent the denominator from being zero.
- **TF - IDF Formula**:
\[
tfidf_{i,j}=tf_{i,j}\times idf_i
\]
- **Soft - margin SVM Constraint Conditions**:
\[
y_i(w^T x_i + b)\geq1-\xi_i,\quad i = 1,\ldots,n
\]
where \(x_i\) is the vector representation of sentence \(i\), \(y_i\) is the label of sentence \(i\), and \(\xi_i\geq0\) is a slack variable.
- **Soft - margin SVM Objective Function**:
\[
\min\frac{1}{2}\|w\|^2 + C\sum_{i = 1}^n\xi_i
\]
where \(C>0\) is a penalty parameter that controls the relative weight between the two terms in the objective function.
Through the above methods and models, the author has successfully improved the accuracy and efficiency of automatically extracting PICO elements from medical literature, thus providing strong support for evidence - based medicine research.