Abstract:The application of machine learning (ML) techniques to digitized images of biopsied cells for breast cancer diagnosis is an active area of research. We hypothesized that reducing noise in the data would lead to an increase in classification accuracies. To test this hypothesis, we first compared several classification techniques in their ability to discriminate between malignant and benign breast cancer tumors using the Wisconsin Breast Cancer Data Set and subsequently evaluated the effect of noise reduction techniques on model accuracies. We applied two noise-reduction techniques based on Principal Component Analysis – dimensionality reduction and outlier removal – to a comprehensive list of ML algorithms with different learning paradigms including Decision Trees (fine, medium, coarse), dimensionality reduction techniques (Linear Discriminant Analysis, Quadratic Discriminant Analysis, Partial Least Squares-Discriminant Analysis), logistic Regression, Bayesian techniques (Gaussian Naive, Kernel Naive), Support Vector Machines (Linear, Quadratic, Cubic, Gaussian), instance-based techniques (fine, medium, coarse, cosine, cubic, and weighted K-Nearest Neighbors), and Artificial Neural Networks. Results showed that noise removal through dimensionality reduction is most effective when using a cross-validated number of principal components, and accuracies surpassing 99% across all ML models are obtained when both noise-reduction techniques are applied sequentially. Even though such a high accuracy has been demonstrated in few instances for specific algorithms, the methodology proposed herein is the first published report demonstrating the applicability of a technique to a wide range of ML models to achieve high accuracies. We show that dimensionality reduction and outlier analysis can be used as effective approaches to improve discrimination accuracies. Also, dimensionality reduction through a cross-validated number of principal components can provide an effective framework for reducing noise in the data prior to applying a ML algorithm.

An Evaluation of Classification and Outlier Detection Algorithms

Outlier Detection Using Machine Learning Algorithms Integrated with Bayesian Optimization

The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances

Outlier Detection as Instance Selection Method for Feature Selection in Time Series Classification

Experimental Comparison and Survey of Twelve Time Series Anomaly Detection Algorithms

A method for outlier detection based on cluster analysis and visual expert criteria

Ordinal Outlier Detection Based On Recursive Uniform Partitioning

FAST-ODT: A Lightweight Outlier Detection Scheme for Categorical Data Sets.

Outlier Detection Method for Time Series Based on the Rate of Signal Change

Identification of Outlier Patterns in Multivariate Time Series

Understanding Time Series Anomaly State Detection through One-Class Classification

General value functions for fault detection in multivariate time series data

Variance Clustering Based Outlier Identification Algorithm for Time Series Data

An Evaluation of Anomaly Detection and Diagnosis in Multivariate Time Series

A Fast Greedy Algorithm for Outlier Mining

A Benchmark to Select Data Mining Based Classification Algorithms For Business Intelligence And Decision Support Systems

A Gradient-Boosted Decision-Tree Algorithm for the Prediction of Short-Term Mortality in Acute Heart Failure Patients

A Parametric and Non-Parametric Approach for High-Accurate Outlier Detection.

A Genetic Algorithm Based Technique for Outlier Detection with Fast Convergence.

Is it worth it? Comparing six deep and classical methods for unsupervised anomaly detection in time series

The Outlier Interval Detection Algorithms on Astronautical Time Series Data