Abstract:Distance metrics and their nonlinear variant play a crucial role in machine learning based real-world problem solving. We demonstrated how Euclidean and cosine distance measures differ not only theoretically but also in real-world medical application, namely, outcome prediction of drug prescription. Euclidean distance exhibits favorable properties in the local geometry problem. To this regard, Euclidean distance can be applied under short-term disease with low-variation outcome observation. Moreover, when presenting to highly variant chronic disease, it is preferable to use cosine distance. These different geometric properties lead to different submanifolds in the original embedded space, and hence, to different optimizing nonlinear kernel embedding frameworks. We first established the geometric properties that we needed in these frameworks. From these properties interpreted their differences in certain perspectives. Our evaluation on real-world, large-scale electronic health records and embedding space visualization empirically validated our approach.
What problem does this paper attempt to address?
### Problems the paper attempts to solve
This paper aims to solve the problem of distance metric selection in drug prescription efficacy prediction. Specifically, it explores the impact of Euclidean distance and cosine distance on drug prescription efficacy prediction under different disease types (short - term and chronic diseases) and data distribution situations (balanced and unbalanced data).
#### Main problems
1. **Selection of different distance metrics**: In drug prescription efficacy prediction, how to select an appropriate distance metric (such as Euclidean distance or cosine distance) to improve the prediction accuracy.
2. **Differences in disease types**: How do the differences in time span and treatment regimens between short - term and chronic diseases affect the selection of distance metrics?
3. **Impact of data distribution**: How does the balance of data (for example, common diseases vs. rare diseases) affect the performance of different distance metrics?
#### Specific objectives
- Propose a unified framework for comparing the performance of Euclidean distance and cosine distance in drug prescription efficacy prediction.
- Evaluate the performance of these distance metrics on large - scale, real - world electronic health record (EHR) datasets, covering common and rare diseases, short - term and chronic diseases.
- Geometrically explain the differences between Euclidean distance and cosine distance and explore their applicability in different application scenarios.
### Method overview
To achieve the above objectives, the authors propose a learning framework based on graph kernels, which combines multiple graph kernel methods (such as Weisfeiler - Lehman subtree kernels, temporal topological kernels, vertex histogram kernels) and deep neural networks, and performs distance regularization through contrastive loss. This framework can systematically evaluate and compare the performance of Euclidean distance and cosine distance under different disease types and data distribution situations.
### Experimental results
The experimental results show that in the case of short - term diseases and balanced data, the performance of Euclidean distance and cosine distance is comparable; while in the case of short - term but unbalanced data, Euclidean distance performs better; for chronic diseases, especially when the data is unbalanced, cosine distance shows significant advantages.
### Conclusion
This study emphasizes the importance of selecting appropriate distance metrics according to different disease types and data distribution, and provides a systematic evaluation framework for drug prescription efficacy prediction.