Abstract:In Reinforcement Learning (RL), multi-armed Bandit (MAB) problems have found applications across diverse domains such as recommender systems, healthcare, and finance. Traditional MAB algorithms typically assume stationary reward distributions, which limits their effectiveness in real-world scenarios characterized by non-stationary dynamics. This paper addresses this limitation by introducing and evaluating novel Bandit algorithms designed for non-stationary environments. First, we present the Adaptive Discounted Thompson Sampling (ADTS) algorithm, which enhances adaptability through relaxed discounting and sliding window mechanisms to better respond to changes in reward distributions. We then extend this approach to the Portfolio Optimization problem by introducing the Combinatorial Adaptive Discounted Thompson Sampling (CADTS) algorithm, which addresses computational challenges within Combinatorial Bandits and improves dynamic asset allocation. Additionally, we propose a novel architecture called Bandit Networks, which integrates the outputs of ADTS and CADTS, thereby mitigating computational limitations in stock selection. Through extensive experiments using real financial market data, we demonstrate the potential of these algorithms and architectures in adapting to dynamic environments and optimizing decision-making processes. For instance, the proposed bandit network instances present superior performance when compared to classic portfolio optimization approaches, such as capital asset pricing model, equal weights, risk parity, and Markovitz, with the best network presenting an out-of-sample Sharpe Ratio 20% higher than the best performing classical model.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the Multi - Armed Bandit (MAB) problem in non - stationary environments, especially how to improve the combinatorial optimization results to adapt to the dynamic changes in financial markets. Specifically, the paper mainly focuses on the following points: 1. **Non - stationary reward distribution**: Traditional MAB algorithms usually assume that the reward distribution is stationary, that is, it remains unchanged over time. However, in the real world, especially in financial markets, the reward distribution is often non - stationary and changes over time. This non - stationarity greatly reduces the effectiveness of traditional algorithms in practical applications. 2. **Portfolio optimization**: In the financial field, portfolio optimization is a key issue, aiming to achieve the best balance between risk and return by selecting and allocating assets. Traditional portfolio optimization methods usually rely on static allocation strategies, which may not be able to effectively respond to changes in market conditions. Therefore, a method that can dynamically adjust asset weights is needed to better adapt to market dynamics. 3. **Computational challenges**: In combinatorial optimization problems, especially when multiple assets are involved, computational complexity is an important challenge. How to efficiently handle these problems and find the optimal solution is an urgent problem to be solved. To solve the above problems, the paper proposes the following innovative methods: - **Adaptive Discounted Thompson Sampling (ADTS)**: By introducing the relaxation discount and smoothing window mechanism, the ADTS algorithm improves the adaptability to changes in non - stationary reward distributions. - **Combinatorial Adaptive Discounted Thompson Sampling (CADTS)**: For combinatorial optimization problems, the CADTS algorithm solves the computational challenges in combinatorial bandits and optimizes the decision - making process in dynamic environments. - **Bandit Networks**: This is a new architecture that integrates the outputs of ADTS and CADTS, thereby reducing the computational limitations in stock selection and improving the robustness and accuracy of the model. Through extensive experiments using real - world financial market data, the paper demonstrates the potential of these new algorithms and architectures in adapting to dynamic environments and optimizing decision - making processes. For example, the proposed Bandit Networks instance outperforms classical portfolio optimization methods such as the Capital Asset Pricing Model (CAPM), equal - weight, risk - parity, and Markowitz model, and the out - of - sample Sharpe ratio of the best - performing network instance is 20% higher than that of the best - performing classical model. In summary, this paper aims to solve the challenges in portfolio optimization in non - stationary environments by proposing new MAB algorithms and architectures, thereby improving the efficiency and accuracy of financial decision - making.

Improving Portfolio Optimization Results with Bandit Networks

Adaptive Portfolio by Solving Multi-armed Bandit Via Thompson Sampling

Bayesian Optimization -- Multi-Armed Bandit Problem

Optimizing Sharpe Ratio: Risk-Adjusted Decision-Making in Multi-Armed Bandits

Portfolio Choices with Orthogonal Bandit Learning

A Risk-Averse Framework for Non-Stationary Stochastic Multi-Armed Bandits

Contextual combinatorial bandit on portfolio management

Multi-Armed Bandit Strategies for Non-Stationary Reward Distributions and Delayed Feedback Processes

Multi-armed Bandit Models for the Optimal Design of Clinical Trials: Benefits and Challenges

Design Principles of Robust Multi-Armed Bandit Framework in Video Recommendations

Non-stationary Bandits with Habituation and Recovery Dynamics and Knapsack Constraints

Optimising Individual-Treatment-Effect Using Bandits

Risk-Aware Multi-Armed Bandit Problem with Application to Portfolio Selection

Finite-time Analysis of Globally Nonstationary Multi-Armed Bandits

Algorithms for multi-armed bandit problems

Autoregressive Bandits

Multi-Armed Bandit with Budget Constraint and Variable Costs.

Multi-Task Combinatorial Bandits for Budget Allocation

Efficient Algorithms for Finite Horizon and Streaming Restless Multi-Armed Bandit Problems

Auction-Based Combinatorial Multi-Armed Bandit Mechanisms with Strategic Arms

Kolmogorov-Smirnov Test-Based Actively-Adaptive Thompson Sampling for Non-Stationary Bandits