Optimal Cycling of a Heterogenous Battery Bank via Reinforcement Learning

Vivek Deulkar,Jayakrishnan Nair
DOI: https://doi.org/10.48550/arXiv.2109.07137
2021-09-15
Abstract:We consider the problem of optimal charging/discharging of a bank of heterogenous battery units, driven by stochastic electricity generation and demand processes. The batteries in the battery bank may differ with respect to their capacities, ramp constraints, losses, as well as cycling costs. The goal is to minimize the degradation costs associated with battery cycling in the long run; this is posed formally as a Markov decision process. We propose a linear function approximation based Q-learning algorithm for learning the optimal solution, using a specially designed class of kernel functions that approximate the structure of the value functions associated with the MDP. The proposed algorithm is validated via an extensive case study.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in a battery pack composed of heterogeneous battery cells, how to achieve the optimal charge - discharge scheduling through the reinforcement learning method to minimize the long - term degradation cost caused by battery cycling. These batteries may differ in capacity, ramp constraints, loss and cycling cost. The paper formalizes this problem as a Markov Decision Process (MDP) and proposes a Q - learning algorithm based on linear function approximation to learn the optimal solution. Specifically, the goal of the paper is to minimize the degradation cost related to battery cycling over a long period. To achieve this goal, the author designs a special class of kernel functions to approximate the value function structure related to MDP and uses these kernel functions to assist the learning process. In addition, the paper also verifies the effectiveness of the proposed algorithm through extensive case studies. Expressed in formula, this problem can be described as maximizing the infinite - time discounted reward: \[ \max_{\pi} E\left[\sum_{k = 0}^{\infty} \gamma^k R(S_k, A_k)\right] \] where \(S_k\) represents the state at the \(k\)-th moment, \(A_k\) represents the action taken, \(R(S_k, A_k)\) is the immediate reward after taking the action, and \(\gamma\in(0, 1)\) is the discount factor. The state \(S_k\) contains the state \(X_k\) of the background Markov chain and the energy storage amount \(B_k=(B(i)_k, 1\leq i\leq N)\) of each battery, and the action \(A_k\) determines the energy change amount of each battery. Through this method, the author aims to find a strategy that can effectively manage different types of battery cells in the face of random power generation and demand, thereby reducing the loss of batteries due to frequent charging and discharging.