Abstract:The success of deep reinforcement learning (DRL) lies in its ability to learn a representation that is well-suited for the exploration and exploitation task. To understand how the choice of representation can improve the efficiency of reinforcement learning (RL), we study representation selection for a class of low-rank Markov Decision Processes (MDPs) where the transition kernel can be represented in a bilinear form. We propose an efficient algorithm, called ReLEX, for representation learning in both online and offline RL. Specifically, we show that the online version of ReLEX, called ReLEX-UCB, always performs no worse than the state-of-the-art algorithm without representation selection, and achieves a strictly better constant regret if the representation function class has a "coverage" property over the entire state-action space. For the offline counterpart, ReLEX-LCB, we show that the algorithm can find the optimal policy if the representation class can cover the state-action space and achieves gap-dependent sample complexity. This is the first result with constant sample complexity for representation learning in offline RL.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in low - rank Markov decision processes (MDPs), how to improve the sample efficiency of reinforcement learning (RL) by choosing an appropriate representation. Specifically, the paper explores methods for selecting the optimal representation from multiple possible representations in online and offline reinforcement learning environments to optimize the learning effect.
### Problem Background
In deep reinforcement learning (DRL), the key to success lies in finding a representation suitable for exploring and exploiting tasks. To understand the impact of representation selection on the efficiency of reinforcement learning, the authors studied a special class of low - rank MDPs, in which the transition kernel can be represented in a bilinear form. The characteristic of this type of MDP is that its transition probability matrix can be represented by the product of a known feature mapping and an unknown matrix.
### Main Research Questions
1. **The impact of representation selection on sample efficiency**: Can the sample efficiency of online and offline reinforcement learning be improved by choosing an appropriate representation?
2. **Algorithm design**: How to design an algorithm that can effectively select the optimal representation?
### Main Contributions of the Paper
1. **Online reinforcement learning**:
- A new algorithm, ReLEX - UCB, is proposed, which improves learning efficiency by selecting the optimal representation.
- It is proved that ReLEX - UCB performs as well as existing algorithms that do not select representations in the worst - case scenario, and when the representation function class has the property of covering the entire state - action space, a better regret bound can be obtained.
2. **Offline reinforcement learning**:
- The ReLEX - LCB algorithm is proposed for representation selection in an offline environment.
- It is proved that ReLEX - LCB can find the optimal policy under certain conditions and has a gap - dependent sample complexity on the data generated by the behavior policy. This is the first result to achieve a constant sample complexity in offline RL.
### Technical Details
- **Bilinear MDP**: The paper considers a special class of low - rank MDPs, called bilinear MDPs, in which the transition kernel \(P(s'|s, a)\) can be represented in a bilinear form of a known feature mapping \(\phi(s, a)\), an unknown matrix \(M^*\), and a known feature mapping \(\psi(s')\):
\[
P(s'|s, a)=\phi^{\top}(s, a)M^*_h\psi(s')
\]
- **Representation selection**: The core idea of the algorithm is to select the representation that minimizes the optimistic Q - value function from a finite representation function class \(\Phi\). For online RL, the representation that maximizes the Q - value is selected; for offline RL, the representation that minimizes the Q - value is selected.
- **Theoretical guarantees**: The paper provides a strict theoretical analysis to prove the effectiveness of the proposed algorithms in different scenarios. In particular, the online algorithm ReLEX - UCB performs as well as existing algorithms in the worst - case scenario and can obtain a better regret bound under certain conditions; the offline algorithm ReLEX - LCB achieves a constant sample complexity under certain assumptions.
### Experimental Verification
The paper verifies the effectiveness of the proposed algorithms through experiments on different MDPs. The experimental results show that, whether in an online or offline environment, ReLEX - UCB and ReLEX - LCB outperform the case of using a single representation, thus confirming the advantages of representation selection.
In conclusion, this paper solves the problem of how to improve the sample efficiency of reinforcement learning through representation selection in low - rank MDPs by introducing new algorithms and techniques, and provides strict theoretical guarantees and experimental evidence.