FinML-Chain: A Blockchain-Integrated Dataset for Enhanced Financial Machine Learning

Jingfeng Chen,Wanlin Deng,Dangxing Chen,Luyao Zhang
2024-11-25
Abstract:Machine learning is critical for innovation and efficiency in financial markets, offering predictive models and data-driven decision-making. However, challenges such as missing data, lack of transparency, untimely updates, insecurity, and incompatible data sources limit its effectiveness. Blockchain technology, with its transparency, immutability, and real-time updates, addresses these challenges. We present a framework for integrating high-frequency on-chain data with low-frequency off-chain data, providing a benchmark for addressing novel research questions in economic mechanism design. This framework generates modular, extensible datasets for analyzing economic mechanisms such as the Transaction Fee Mechanism, enabling multi-modal insights and fairness-driven evaluations. Using four machine learning techniques, including linear regression, deep neural networks, XGBoost, and LSTM models, we demonstrate the framework's ability to produce datasets that advance financial research and improve understanding of blockchain-driven systems. Our contributions include: (1) proposing a research scenario for the Transaction Fee Mechanism and demonstrating how the framework addresses previously unexplored questions in economic mechanism design; (2) providing a benchmark for financial machine learning by open-sourcing a sample dataset generated by the framework and the code for the pipeline, enabling continuous dataset expansion; and (3) promoting reproducibility, transparency, and collaboration by fully open-sourcing the framework and its outputs. This initiative supports researchers in extending our work and developing innovative financial machine-learning models, fostering advancements at the intersection of machine learning, blockchain, and economics.
General Economics
What problem does this paper attempt to address?
This paper attempts to solve several key problems faced in financial machine learning, specifically including: 1. **Data Missing and Opacity**: Data in the financial market often has problems of being missing and opaque, which affects the accuracy and reliability of prediction models. 2. **Untimely Data Update**: Traditional data sources update slowly and cannot reflect market changes in real - time, resulting in model prediction lag. 3. **Data Security and Compatibility Issues**: There is poor compatibility between different data sources, and insufficient data security, which increases the complexity and risk of data processing. To solve these problems, the author introduced blockchain technology. With its characteristics of transparency, immutability and real - time update, blockchain can effectively meet the above challenges. Specifically, this paper proposes a new framework - FinML - Chain, which is used to integrate high - frequency on - chain data and low - frequency off - chain data to generate modular and extensible data sets to support the research of financial machine learning. ### Main Research Questions 1. **Can this data set be used to apply different machine - learning models for innovative financial problem research?** - The author verified the feasibility of this data set in predicting the Gas usage of future blocks by training four machine - learning models, namely linear regression, deep neural network (DNN), XGBoost and long - short - term memory network (LSTM). 2. **How to optimize the blockchain - based transaction fee mechanism (Transaction Fee Mechanism, TFM), such as Ethereum's EIP - 1559 mechanism?** - The author proposed a novel research scenario, that is, using machine - learning methods to accurately predict the Gas price of upcoming transactions, and adjusting the base fee according to these predictions, thus changing the TFM from post - hoc adjustment to ex - ante adjustment, improving its flexibility and efficiency. ### Specific Contributions 1. **Introduced a New Framework and Benchmark Data Set**: - This framework integrates high - frequency on - chain data and low - frequency off - chain data, solves the problems of traditional data sets in terms of transparency, reliability and timeliness, generates modular and extensible data sets, and provides a new research direction for economic mechanism design. 2. **Proposed an Innovative Research Scenario**: - Use machine learning to optimize the blockchain's transaction fee mechanism,超越传统的反应式方法, and achieve more flexible and efficient transaction fee management. 3. **Open - Source Data and Code Pipeline**: - To ensure reproducibility and promote cooperation, the author not only open - sourced the sample data set generated by the framework, but also made the complete code pipeline public, enabling researchers to continuously expand the data set and adapt to new situations, and explore broader financial and blockchain - related challenges. Through these contributions, this paper aims to establish a basis for interdisciplinary research and promote innovation and development in the fields of machine learning, blockchain and economics.