Fitting Multiple Machine Learning Models with Performance Based Clustering

Mehmet Efe Lorasdagi,Ahmet Berker Koc,Ali Taha Koc,Suleyman Serdar Kozat
2024-11-11
Abstract:Traditional machine learning approaches assume that data comes from a single generating mechanism, which may not hold for most real life data. In these cases, the single mechanism assumption can result in suboptimal performance. We introduce a clustering framework that eliminates this assumption by grouping the data according to the relations between the features and the target values and we obtain multiple separate models to learn different parts of the data. We further extend our framework to applications having streaming data where we produce outcomes using an ensemble of models. For this, the ensemble weights are updated based on the incoming data batches. We demonstrate the performance of our approach over the widely-studied real life datasets, showing significant improvements over the traditional single-model approaches.
Machine Learning,Signal Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: Traditional machine - learning methods assume that data comes from a single generating mechanism, which may not hold in real - life situations, especially when dealing with non - stationary or data from different sources. This assumption of a single generating mechanism may lead to sub - optimal model performance because the differences between different parts of the data are ignored. Specifically, the paper points out the following problems: 1. **Limitations of the single - generating - mechanism assumption**: Traditional machine - learning methods assume that all data points come from the same generating mechanism, which often does not hold in practical applications. For example, the energy generated by a wind power plant at the same wind speed and angle may vary due to differences in turbine dynamics; in crime prediction data, the relationship between features and the crime rate may change depending on location and time. 2. **Sub - optimal performance**: Due to the existence of the single - generating - mechanism assumption, machine - learning models can only learn the average relationship of different data parts, resulting in a decline in performance. Using multiple independent machine - learning models to learn different parts of the data can improve the modeling performance. To solve these problems, the author proposes a performance - based clustering framework. By clustering data according to the relationship between feature vectors and target values, the assumption of a single generating mechanism is eliminated, and multiple independent models are constructed for different data parts. In addition, this framework is also extended to the application scenario of streaming data. Through the ensemble method, it provides output in an online environment and updates the ensemble weights according to newly arrived data batches. ### Specific contributions 1. **Performance - based clustering framework**: Clustering is carried out by considering the relationship between feature vectors and target values, rather than only based on the feature vectors themselves. 2. **Ensemble method for online or sequential application scenarios**: For streaming data, a weighted ensemble method is used to update model weights online. 3. **Experimental verification**: Through extensive experiments on multiple well - known competition datasets, it is proved that this method has a significant performance improvement compared with the traditional single - model method. 4. **Code release**: For the reproducibility of results and further research, the author releases the source code. Through these improvements, the paper aims to improve the performance of machine - learning models in complex, multi - mechanism scenarios, especially when dealing with non - stationary or data from different sources.