Sensitivity Assessing to Data Volume for forecasting: introducing similarity methods as a suitable one in Feature selection methods

Mahdi Goldani Soraya Asadi Tirvan
2024-06-07
Abstract:In predictive modeling, overfitting poses a significant risk, particularly when the feature count surpasses the number of observations, a common scenario in high-dimensional data sets. To mitigate this risk, feature selection is employed to enhance model generalizability by reducing the dimensionality of the data. This study focuses on evaluating the stability of feature selection techniques with respect to varying data volumes, particularly employing time series similarity methods. Utilizing a comprehensive dataset that includes the closing, opening, high, and low prices of stocks from 100 high-income companies listed in the Fortune Global 500, this research compares several feature selection methods including variance thresholds, edit distance, and Hausdorff distance metrics. The aim is to identify methods that show minimal sensitivity to the quantity of data, ensuring robustness and reliability in predictions, which is crucial for financial forecasting. Results indicate that among the tested feature selection strategies, the variance method, edit distance, and Hausdorff methods exhibit the least sensitivity to changes in data volume. These methods therefore provide a dependable approach to reducing feature space without significantly compromising the predictive accuracy. This study not only highlights the effectiveness of time series similarity methods in feature selection but also underlines their potential in applications involving fluctuating datasets, such as financial markets or dynamic economic conditions. The findings advocate for their use as principal methods for robust feature selection in predictive analytics frameworks.
General Economics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to select the most appropriate features to reduce the risk of over - fitting and improve the generalization ability of the model in predictive modeling when the number of features exceeds the number of observations. Specifically, the paper focuses on the stability of feature selection methods under different data volumes. In particular, in the case of a small amount of data, which feature selection methods can provide more reliable results. The research uses the time - series similarity method as one of the feature selection methods and evaluates the performance of these methods through financial data sets. ### Background and Problem of the Paper In predictive modeling, over - fitting is a significant risk, especially in high - dimensional data sets where the number of features usually far exceeds the number of observations. To reduce this risk, feature selection is widely used to reduce the dimension of data, thereby enhancing the generalization ability of the model. However, different feature selection methods may perform differently under different data volumes. Therefore, this paper aims to evaluate the stability of feature selection techniques under different data volumes, especially using the time - series similarity method. ### Research Objectives 1. **Evaluate the Stability of Feature Selection Methods**: Research the performance stability of different feature selection methods when the data volume changes. 2. **Identify Effective Methods under Low Data Volumes**: Determine which feature selection methods can provide more reliable results when the data volume is small. 3. **Apply the Time - Series Similarity Method**: Explore the application of the time - series similarity method in feature selection and evaluate its effectiveness in financial prediction. ### Methods and Data - **Data Set**: The research uses the stock data of 100 high - income companies in the Fortune Global 500, including opening price, closing price, highest price, lowest price and trading volume. - **Methods**: Multiple feature selection methods are compared, including variance threshold, edit distance and Hausdorff distance, etc. Each method gradually reduces the size of the data set in 80 steps, reducing by 1% each time until the size of the data set is only 20% of the original data set. - **Model Training and Evaluation**: Use a linear regression model to train the selected features and evaluate the performance of the model through 10 - fold cross - validation. ### Main Findings - **Sensitivity of Methods**: The research shows that the variance method, edit distance and Hausdorff method show low sensitivity when the data volume changes, and can maintain high prediction accuracy while reducing the feature space. - **Effectiveness of the Time - Series Similarity Method**: The time - series similarity method performs well in feature selection and is especially suitable for processing data sets with large fluctuations, such as financial markets or dynamic economic conditions. ### Conclusion This paper not only verifies the effectiveness of the time - series similarity method in feature selection, but also emphasizes the advantages of these methods in handling small - sample data. The research results support the application of these methods as robust feature selection strategies in the predictive analysis framework.