A system for exploring big data: an iterative k-means searchlight for outlier detection on open health data

A. Ravishankar Rao,Daniel Clarke,Subrata Garai,Soumyabrata Dey

DOI: https://doi.org/10.1109/IJCNN.2018.8489448

2023-04-05

Abstract:The interactive exploration of large and evolving datasets is challenging as relationships between underlying variables may not be fully understood. There may be hidden trends and patterns in the data that are worthy of further exploration and analysis. We present a system that methodically explores multiple combinations of variables using a searchlight technique and identifies outliers. An iterative k-means clustering algorithm is applied to features derived through a split-apply-combine paradigm used in the database literature. Outliers are identified as singleton or small clusters. This algorithm is swept across the dataset in a searchlight manner. The dimensions that contain outliers are combined in pairs with other dimensions using a susbset scan technique to gain further insight into the outliers. We illustrate this system by anaylzing open health care data released by New York State. We apply our iterative k-means searchlight followed by subset scanning. Several anomalous trends in the data are identified, including cost overruns at specific hospitals, and increases in diagnoses such as suicides. These constitute novel findings in the literature, and are of potential use to regulatory agencies, policy makers and concerned citizens.

Machine Learning

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to develop a system for exploring outlier detection in large datasets. Specifically, it addresses the problem through the following ways: 1. **Data Exploration Challenges**: - When faced with large and constantly evolving datasets, users may not fully understand the relationships between underlying variables. Therefore, there may be hidden trends and patterns in the data that are worth further exploration and analysis. 2. **System Introduction**: - A system is proposed that systematically explores multiple variable combinations using the searchlight technique and identifies outliers. This system applies an iterative K-means clustering algorithm to handle the split-apply-combine paradigm commonly used in database literature. Outliers are identified as single or small clusters. 3. **Algorithm Application**: - The algorithm scans the dataset in a searchlight manner and combines subset scanning techniques to gain further insights into outliers. Through this method, abnormal trends in the data can be discovered, such as cost overruns in specific hospitals and increases in suicide diagnoses. 4. **Practical Application**: - The practicality of the system is demonstrated through the analysis of open healthcare data released by the state of New York. Identified abnormal trends include cost overruns in specific hospitals, increases in suicide diagnoses, etc. These findings have potential practical application value and can be used for decision-making by regulatory agencies, policymakers, and concerned citizens. In summary, the goal of this paper is to help users quickly and effectively identify meaningful trends and anomalies from large healthcare datasets by developing a new outlier detection system, thereby supporting better policy-making and social oversight.

A system for exploring big data: an iterative k-means searchlight for outlier detection on open health data

PIKS: A Technique to Identify Actionable Trends for Policy-Makers Through Open Healthcare Data

Outlier analysis for accelerating clinical discovery: An augmented intelligence framework and a systematic review

Selection Bias Tracking and Detailed Subset Comparison for High-Dimensional Data

Hiding in Plain Sight: Insights about Health-Care Trends Gained through Open Health Data

A Hybrid Outlier Detection Method for Health Care Big Data

Distributed Learning from Multi-Site Observational Health Data for Zero-Inflated Count Outcomes

Simultaneous feature selection and outlier detection with optimality guarantees

Detection of outlying patterns from sparse and irregularly sampled electronic health records data

Outlier Ranking in Large-Scale Public Health Streams

Improved Method for Noise Detection by DBSCAN and Angle Based Outlier Factor in High Dimensional Datasets

Scalable Predictive Analysis in Critically Ill Patients Using a Visual Open Data Analysis Platform

On Saving Outliers for Better Clustering over Noisy Data.

shinyOPTIK, a User-Friendly R Shiny Application for Visualizing Cancer Risk Factors and Mortality Across the University of Kansas Cancer Center Catchment Area

Data Analytics in Health Management System

Outlier Detection and Spatial Analysis Algorithms

Heart Disease Prediction using Exploratory Data Analysis

An Optimized Integrated Framework of Big Data Analytics Managing Security and Privacy in Healthcare Data

Enabling scalable clinical interpretation of ML-based phenotypes using real world data

Human-in-the-loop Outlier Detection.