Clustering Indices based Automatic Classification Model Selection

Sudarsun Santhiappan,Nitin Shravan,Balaraman Ravindran

DOI: https://doi.org/10.48550/arXiv.2305.13926

2023-05-23

Abstract:Classification model selection is a process of identifying a suitable model class for a given classification task on a dataset. Traditionally, model selection is based on cross-validation, meta-learning, and user preferences, which are often time-consuming and resource-intensive. The performance of any machine learning classification task depends on the choice of the model class, the learning algorithm, and the dataset's characteristics. Our work proposes a novel method for automatic classification model selection from a set of candidate model classes by determining the empirical model-fitness for a dataset based only on its clustering indices. Clustering Indices measure the ability of a clustering algorithm to induce good quality neighborhoods with similar data characteristics. We propose a regression task for a given model class, where the clustering indices of a given dataset form the features and the dependent variable represents the expected classification performance. We compute the dataset clustering indices and directly predict the expected classification performance using the learned regressor for each candidate model class to recommend a suitable model class for dataset classification. We evaluate our model selection method through cross-validation with 60 publicly available binary class datasets and show that our top3 model recommendation is accurate for over 45 of 60 datasets. We also propose an end-to-end Automated ML system for data classification based on our model selection method. We evaluate our end-to-end system against popular commercial and noncommercial Automated ML systems using a different collection of 25 public domain binary class datasets. We show that the proposed system outperforms other methods with an excellent average rank of 1.68.

Machine Learning,Artificial Intelligence

What problem does this paper attempt to address?

### The Problem Addressed by This Paper This paper aims to address the problem of automatic model selection in machine learning classification tasks. Traditional model selection methods are based on cross-validation, meta-learning, and user preferences, which are often time-consuming and resource-intensive. The paper proposes a novel approach—CIAMS (Clustering Indices based Automatic Model Selection), which estimates the empirical model fitness of different candidate model categories by only utilizing the clustering indices of the dataset. **Specifically, the goals of the paper include:** 1. **Proposing a new hypothesis**: The classification performance on a dataset depends on the clustering indices of that dataset. 2. **Developing a new method**: Predicting the expected classification performance of a model category on a given dataset without actually building the classification model. 3. **Applying clustering indices for automatic model selection**: Using clustering indices as meta-features to automatically select the appropriate model from a set of model categories. 4. **Building an automated machine learning platform**: Developing an end-to-end automated machine learning platform based on the aforementioned model selection method to provide classification modeling services. The paper validates the effectiveness of its model selection method through experiments and demonstrates its performance on 60 publicly available binary classification datasets. Additionally, it conducts comparative experiments with existing commercial and non-commercial automated machine learning systems, showing that the proposed method has an excellent average ranking (1.68), proving its practicality.

Clustering Indices based Automatic Classification Model Selection

Exploring automated Feature Selection for Model-based and Density-based clustering with application to NCI 60 data

Medical Datasets Classification using a Hybrid Genetic Algorithm for Feature Selection based on Pearson Correlation Coefficient

Cluster-oriented instance selection for classification problems

Rethinking Recommender Systems: Cluster-based Algorithm Selection

Assessment of feature selection for student academic performance through machine learning classification

CLAMS: A System for Zero-Shot Model Selection for Clustering

Appropriateness of Performance Indices for Imbalanced Data Classification: An Analysis

Performance evaluation of some clustering algorithms and validity indices

From A-to-Z Review of Clustering Validation Indices

A new approach for evaluating internal cluster validation indices

Automatic Dimension Selection for a Non-negative Factorization Approach to Clustering Multiple Random Graphs

Cross-Validation Approach to Evaluate Clustering Algorithms: An Experimental Study Using Multi-Label Datasets

Multi-objective Semi-supervised Clustering for Finding Predictive Clusters

A comparative study on clustering-based classification algorithms

Nature-inspired metaheuristic techniques for automatic clustering: a survey and performance study

Automatic Recommendation of a Distance Measure for Clustering Algorithms

Automatic Data Clustering by Hybrid Enhanced Firefly and Particle Swarm Optimization Algorithms

Feature Selection Based on Data Clustering

A discriminative model selection approach and its application to text classification

Investigation on Several Model Selection Criteria for Determining the Number of Cluster