The Performance of the LSTM-based Code Generated by Large Language Models (LLMs) in Forecasting Time Series Data

Saroj Gopali,Sima Siami-Namini,Faranak Abri,Akbar Siami Namin
2024-11-28
Abstract:As an intriguing case is the goodness of the machine and deep learning models generated by these LLMs in conducting automated scientific data analysis, where a data analyst may not have enough expertise in manually coding and optimizing complex deep learning models and codes and thus may opt to leverage LLMs to generate the required models. This paper investigates and compares the performance of the mainstream LLMs, such as ChatGPT, PaLM, LLama, and Falcon, in generating deep learning models for analyzing time series data, an important and popular data type with its prevalent applications in many application domains including financial and stock market. This research conducts a set of controlled experiments where the prompts for generating deep learning-based models are controlled with respect to sensitivity levels of four criteria including 1) Clarify and Specificity, 2) Objective and Intent, 3) Contextual Information, and 4) Format and Style. While the results are relatively mix, we observe some distinct patterns. We notice that using LLMs, we are able to generate deep learning-based models with executable codes for each dataset seperatly whose performance are comparable with the manually crafted and optimized LSTM models for predicting the whole time series dataset. We also noticed that ChatGPT outperforms the other LLMs in generating more accurate models. Furthermore, we observed that the goodness of the generated models vary with respect to the ``temperature'' parameter used in configuring LLMS. The results can be beneficial for data analysts and practitioners who would like to leverage generative AIs to produce good prediction models with acceptable goodness.
Artificial Intelligence,Software Engineering
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the performance of large language models (LLMs) in generating deep - learning model code for time - series data analysis. Specifically, researchers explore and compare the performance of mainstream large language models (such as ChatGPT, PaLM, LLama and Falcon) in generating deep - learning model code through a series of controlled experiments. These model codes are mainly used to analyze time - series data, which is an important data type widely used in fields such as finance and the stock market. The main objectives of the study are: 1. **Explore the capabilities of LLMs**: Investigate whether LLMs can automatically generate relatively good models (such as the Long - Short - Term Memory model - LSTM) and their corresponding executable code (such as Python code) for data analysts without complex deep - learning model development experience, without the need to learn complex grammar and semantics additionally. 2. **Evaluate the quality of the generated models**: Through sensitivity analysis of control prompts (prompt), study the quality of deep - learning model code generated by LLMs based on the category level. This includes detailed considerations of four criteria: clarity and specificity, goals and intentions, context information, and format and style. 3. **Analyze the performance differences of different LLMs**: Report the performance differences between different LLMs in terms of the accuracy of the generated models for predicting time - series data (especially financial and stock data). The research shows that ChatGPT performs best in most cases. 4. **The influence of the temperature parameter**: Observe the influence of the temperature parameter in the configuration on the performance of the generated deep - learning prediction model. 5. **The influence of prompt complexity**: Explore whether more complex and detailed prompts can generate better and more accurate models. The results show that the performance of the model seems to depend on the setting of the temperature parameter, and in some cases, the model generated by a simple prompt is even better than the model generated by a complex prompt. Through the exploration of these issues, the paper aims to provide valuable insights for data analysts and practitioners, enabling them to use generative AI technology to produce prediction models with acceptable goodness.