A system for exploring big data: an iterative k-means searchlight for outlier detection on open health data

A. Ravishankar Rao,Daniel Clarke,Subrata Garai,Soumyabrata Dey
DOI: https://doi.org/10.1109/IJCNN.2018.8489448
2023-04-05
Abstract:The interactive exploration of large and evolving datasets is challenging as relationships between underlying variables may not be fully understood. There may be hidden trends and patterns in the data that are worthy of further exploration and analysis. We present a system that methodically explores multiple combinations of variables using a searchlight technique and identifies outliers. An iterative k-means clustering algorithm is applied to features derived through a split-apply-combine paradigm used in the database literature. Outliers are identified as singleton or small clusters. This algorithm is swept across the dataset in a searchlight manner. The dimensions that contain outliers are combined in pairs with other dimensions using a susbset scan technique to gain further insight into the outliers. We illustrate this system by anaylzing open health care data released by New York State. We apply our iterative k-means searchlight followed by subset scanning. Several anomalous trends in the data are identified, including cost overruns at specific hospitals, and increases in diagnoses such as suicides. These constitute novel findings in the literature, and are of potential use to regulatory agencies, policy makers and concerned citizens.
Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to develop a system for exploring outlier detection in large datasets. Specifically, it addresses the problem through the following ways: 1. **Data Exploration Challenges**: - When faced with large and constantly evolving datasets, users may not fully understand the relationships between underlying variables. Therefore, there may be hidden trends and patterns in the data that are worth further exploration and analysis. 2. **System Introduction**: - A system is proposed that systematically explores multiple variable combinations using the searchlight technique and identifies outliers. This system applies an iterative K-means clustering algorithm to handle the split-apply-combine paradigm commonly used in database literature. Outliers are identified as single or small clusters. 3. **Algorithm Application**: - The algorithm scans the dataset in a searchlight manner and combines subset scanning techniques to gain further insights into outliers. Through this method, abnormal trends in the data can be discovered, such as cost overruns in specific hospitals and increases in suicide diagnoses. 4. **Practical Application**: - The practicality of the system is demonstrated through the analysis of open healthcare data released by the state of New York. Identified abnormal trends include cost overruns in specific hospitals, increases in suicide diagnoses, etc. These findings have potential practical application value and can be used for decision-making by regulatory agencies, policymakers, and concerned citizens. In summary, the goal of this paper is to help users quickly and effectively identify meaningful trends and anomalies from large healthcare datasets by developing a new outlier detection system, thereby supporting better policy-making and social oversight.