Directional anomaly detection

Oliver Urs Lenz,Matthijs van Leeuwen
2024-10-31
Abstract:Semi-supervised anomaly detection is based on the principle that potential anomalies are those records that look different from normal training data. However, in some cases we are specifically interested in anomalies that correspond to high attribute values (or low, but not both). We present two asymmetrical distance measures that take this directionality into account: ramp distance and signed distance. Through experiments on synthetic and real-life datasets we show that ramp distance performs as well or better than the absolute distance traditionally used in anomaly detection. While signed distance also performs well on synthetic data, it performs substantially poorer on real-life datasets. We argue that this reflects the fact that in practice, good scores on some attributes should not be allowed to compensate for bad scores on others.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in semi - supervised anomaly detection, how to use domain knowledge to handle attributes with directionality (that is, only relatively high attribute values or relatively low attribute values should be regarded as anomalies). Specifically, the author proposes and studies two asymmetric distance measurement methods - ramp distance and signed distance to better capture these directional features. ### Problem Background In traditional semi - supervised anomaly detection, the model is trained only on normal data and attempts to distinguish between normal data and abnormal data. However, in some application scenarios, we are only interested in anomalies in a specific direction. For example, in machine fault detection, we may only care about excessive workload, and in medical diagnosis, we may only focus on high - risk factors rather than low - risk factors or abnormally healthy patients. ### Proposed Solutions To meet this challenge, the author proposes two new distance measurement methods: 1. **Ramp Distance**: \[ d(y, x)=\sum_{j \leq m} d_j(y_j - x_j) \] where for each attribute \(j\), the distance measure \(d_j(y_j - x_j)\) is defined as: \[ d_j(y_j - x_j)=\max(0, y_j - x_j) \] This means that only when the attribute value of the test sample is higher than that of the training sample will it be regarded as an anomaly. 2. **Signed Distance**: \[ d_j(y_j - x_j)=y_j - x_j \] The signed distance directly uses the difference in attribute values and allows negative values to exist. This can be interpreted as low values providing negative evidence and high values providing positive evidence. ### Experimental Results Through experiments on synthetic data sets and real data sets, the author found that: - In the synthetic data set, the signed distance performs slightly better than the ramp distance in some cases. - In the real data set, the ramp distance performs significantly better than the signed distance, especially in the case of multiple risk factors, where unexpected low values should not compensate for other high values. ### Conclusions The author suggests that in practical applications, if it is known that some attributes are risk factors (that is, only high values are meaningful), the ramp distance should be used. In addition, if in a specific data set, the performance of the absolute distance is better than that of the ramp distance, it may be because the directional assumptions of some attributes do not hold, or there is no clear causal relationship linking low values to high risks. In summary, this paper provides a more flexible and effective anomaly detection method by introducing directional anomaly detection, especially in application scenarios involving risk factors.