Automated Problem Identification: Regression vs Classification via Evolutionary Deep Networks

Emmanuel Dufourq,Bruce A. Bassett
DOI: https://doi.org/10.48550/arXiv.1707.00703
2017-07-04
Abstract:Regression or classification? This is perhaps the most basic question faced when tackling a new supervised learning problem. We present an Evolutionary Deep Learning (EDL) algorithm that automatically solves this by identifying the question type with high accuracy, along with a proposed deep architecture. Typically, a significant amount of human insight and preparation is required prior to executing machine learning algorithms. For example, when creating deep neural networks, the number of parameters must be selected in advance and furthermore, a lot of these choices are made based upon pre-existing knowledge of the data such as the use of a categorical cross entropy loss function. Humans are able to study a dataset and decide whether it represents a classification or a regression problem, and consequently make decisions which will be applied to the execution of the neural network. We propose the Automated Problem Identification (API) algorithm, which uses an evolutionary algorithm interface to TensorFlow to manipulate a deep neural network to decide if a dataset represents a classification or a regression problem. We test API on 16 different classification, regression and sentiment analysis datasets with up to 10,000 features and up to 17,000 unique target values. API achieves an average accuracy of $96.3\%$ in identifying the problem type without hardcoding any insights about the general characteristics of regression or classification problems. For example, API successfully identifies classification problems even with 1000 target values. Furthermore, the algorithm recommends which loss function to use and also recommends a neural network architecture. Our work is therefore a step towards fully automated machine learning.
Neural and Evolutionary Computing,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to automatically determine whether a dataset corresponds to a classification problem or a regression problem when dealing with new supervised learning problems. Usually, this requires a great deal of expert knowledge and preparatory work, such as choosing appropriate loss functions and neural network architectures. This paper proposes a deep - learning method based on evolutionary algorithms (Evolutionary Deep Learning, EDL), namely the Automated Problem Identification (API) algorithm, which can automatically identify problem types with high precision and recommend corresponding loss functions and neural network architectures. ### Background and Motivation of the Paper With the significant improvement in the performance of machine - learning algorithms, the relationship between human data scientists and these algorithms is also constantly evolving. Many studies are dedicated to optimizing various aspects of machine - learning algorithms, such as hyper - parameter optimization and network architecture design. The current trend is to develop algorithms that require less human intervention, not only to address the shortage of data scientists but also to achieve the general goal of artificial general intelligence (AGI). However, for most existing machine - learning algorithms, a large amount of human intervention is still required before the final execution of the algorithm, such as setting the number of parameters, pre - processing data, choosing loss functions, and interpreting results. Among them, problem identification is the first step in the data - science process, that is, to determine whether a supervised - learning dataset represents a classification problem or a regression problem. Understanding which type of problem a given dataset belongs to is an important step towards fully automated machine learning. ### Overview of the Solution This paper proposes a method that combines a genetic algorithm (GA) with a dynamic and flexible deep - learning framework to automatically identify problem types. Specifically, the API algorithm manipulates deep neural networks to determine whether a dataset is a classification problem or a regression problem and recommends which loss function to use (such as classification cross - entropy or mean - squared error) and recommends neural network architectures (such as convolutional layers or fully - connected layers). ### Experiments and Results The researchers tested the API algorithm on 16 different datasets, which cover classification, regression, and sentiment - analysis tasks, with up to 10,000 features and up to 17,000 target values. The experimental results show that the API algorithm achieved an average accuracy of 96.3% in identifying problem types without hard - coding the general characteristics of regression or classification problems. In addition, the API algorithm also successfully identified a classification problem with 1,000 target values and recommended appropriate loss functions and neural network architectures. ### Conclusions and Future Work The API algorithm proposed in this paper is an important progress and a solid step towards fully automated machine learning. Future work can further improve the accuracy and efficiency of the algorithm, explore more complex datasets and problem types, and optimize the parameter settings of the genetic algorithm. ### Formula Display - **Mean Squared Error (MSE)**: \[ \text{MSE}=\frac{1}{N}\sum_{i = 1}^{N}(y_i-\bar{y}_i)^2 \] - **Categorical Cross Entropy (CCE)**: \[ \text{CCE}=-\sum_{i = 1}^{N}y_i\ln(\bar{y}_i) \] - **Activation Functions**: - **Linear Activation Function**: \[ f(x)=x \] - **ReLU Activation Function**: \[ f(x)=\max(x,0) \] - **Sigmoid Activation Function**: \[ f(x)=\frac{1}{1 + e^{-x}} \] - **Softmax Activation Function**: \[ f(x_j)=\frac{e^{x_j}}{\sum_{i = 1}^{D}e^{x_i}}