Kaggle forecasting competitions: An overlooked learning opportunity

Casper Solheim Bojer,Jens Peder Meldgaard

DOI: https://doi.org/10.1016/j.ijforecast.2020.07.007

2020-09-16

Abstract:Competitions play an invaluable role in the field of forecasting, as exemplified through the recent M4 competition. The competition received attention from both academics and practitioners and sparked discussions around the representativeness of the data for business forecasting. Several competitions featuring real-life business forecasting tasks on the Kaggle platform has, however, been largely ignored by the academic community. We believe the learnings from these competitions have much to offer to the forecasting community and provide a review of the results from six Kaggle competitions. We find that most of the Kaggle datasets are characterized by higher intermittence and entropy than the M-competitions and that global ensemble models tend to outperform local single models. Furthermore, we find the strong performance of gradient boosted decision trees, increasing success of neural networks for forecasting, and a variety of techniques for adapting machine learning models to the forecasting task.

Machine Learning,Applications

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the practical value and representativeness of prediction competitions held on the Kaggle platform for commercial prediction tasks. Specifically, by analyzing the data sets and solutions of six Kaggle prediction competitions, the paper explores the differences and similarities between these competitions and the M - competitions (especially the M4 competition) widely recognized in academia. The focuses of the paper include: 1. **Data set characteristics**: The paper analyzes the characteristics of time - series data sets in Kaggle competitions, such as intermittency, entropy value (i.e., "predictability"), trend strength, seasonal strength, length of seasonal cycle, first - order autocorrelation coefficient, and optimal Box - Cox transformation parameter. These characteristics help to understand whether the data sets of Kaggle competitions can represent commercial prediction tasks in the real world. 2. **Model performance**: The paper compares the performance of top - ranked solutions in Kaggle competitions with that of simple time - series benchmark methods (such as the naive method and the seasonal naive method) to verify whether these solutions have practical application value. 3. **Method diversity**: The paper also examines the methods used by the top 25 contestants in each competition and finds that over time, machine - learning methods (especially gradient - boosted decision trees and neural networks) have become increasingly prominent in the competitions. Through these analyses, the paper aims to provide lessons learned from Kaggle competitions for the prediction community and provide hypotheses for the upcoming M5 competition. Overall, the paper hopes that through these analyses, it can promote the attention and utilization of Kaggle competition results in academia, thereby promoting research and development in the field of prediction.

Kaggle forecasting competitions: An overlooked learning opportunity

Deep Learning and Linear Programming for Automated Ensemble Forecasting and Interpretation

Robust and Automatic Data Cleansing Method for Short-Term Load Forecasting of Distribution Feeders

The M6 forecasting competition: Bridging the gap between forecasting and investment decisions

Forecasting in social settings: The state of the art

Evaluating forecasting algorithm of realistic datasets based on machine learning

Scalable Probabilistic Forecasting in Retail with Gradient Boosted Trees: A Practitioner's Approach

The Future of Forecasting Competitions: Design Attributes and Principles

ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities

M5 Competition Uncertainty: Overdispersion, distributional forecasting, GAMLSS and beyond

Comparison and Explanation of Forecasting Algorithms for Energy Time Series

The Community Ecology of Sea Otters

For2For: Learning to forecast from forecasts

Who will Win the Data Science Competition? Insights from KDD Cup 2019 and Beyond

An adaptive volatility method for probabilistic forecasting and its application to the M6 financial forecasting competition

Using Experts' Opinions in Machine Learning Tasks

Forecast with Forecasts: Diversity Matters

Hierarchical forecasting at scale

Evaluating the Performance of Machine Learning Algorithms in Financial Market Forecasting: A Comprehensive Survey

Large Language Model Prediction Capabilities: Evidence from a Real-World Forecasting Tournament