Abstract:Learning-to-rank has been intensively studied and has shown significantly increasing values in a wide range of domains, such as web search, recommender systems, dialogue systems, machine translation, and even computational biology, to name a few. In light of recent advances in neural networks, there has been a strong and continuing interest in exploring how to deploy popular techniques, such as reinforcement learning and adversarial learning, to solve ranking problems. However, armed with the aforesaid popular techniques, most studies tend to show how effective a new method is. A comprehensive comparison between techniques and an in-depth analysis of their deficiencies are somehow overlooked. This paper is motivated by the observation that recent ranking methods based on either reinforcement learning or adversarial learning boil down to policy-gradient-based optimization. Based on the widely used benchmark collections with complete information (where relevance labels are known for all items), such as MSLRWEB30K and Yahoo-Set1, we thoroughly investigate the extent to which policy-gradient-based ranking methods are effective. On one hand, we analytically identify the pitfalls of policy-gradient-based ranking. On the other hand, we experimentally compare a wide range of representative methods. The experimental results echo our analysis and show that policy-gradient-based ranking methods are, by a large margin, inferior to many conventional ranking methods. Regardless of whether we use reinforcement learning or adversarial learning, the failures are largely attributable to the gradient estimation based on sampled rankings, which significantly diverge from ideal rankings. In particular, the larger the number of documents per query and the more fine-grained the ground-truth labels, the greater the impact policy-gradient-based ranking suffers. Careful examination of this weakness is highly recommended for developing enhanced methods based on policy gradient.

PolicyBoost: Functional Policy Gradient with Ranking-based Reward Objective

Boosting Nonparametric Policies.

Beyond Reward: Offline Preference-guided Policy Optimization

Model-free Policy Learning with Reward Gradients

Direct Preference-based Policy Optimization without Reward Modeling

Latent-Conditioned Policy Gradient for Multi-Objective Deep Reinforcement Learning

Stochastic Cubic-Regularized Policy Gradient Method

Diagnostic Evaluation of Policy-Gradient-Based Ranking

Policy ensemble gradient for continuous control problems in deep reinforcement learning

Policy Optimization with Smooth Guidance Learned from State-Only Demonstrations

Deterministic Policy Optimization by Combining Pathwise and Score Function Estimators for Discrete Action Spaces

A nearly Blackwell-optimal policy gradient method

Napping for Functional Representation of Policy.

OMPO: A Unified Framework for RL under Policy and Dynamics Shifts

Boosting Weak-to-Strong Agents in Multiagent Reinforcement Learning via Balanced PPO

Reparameterized Policy Learning for Multimodal Trajectory Optimization

Policy Gradient for Reinforcement Learning with General Utilities

Optimistic Natural Policy Gradient: a Simple Efficient Policy Optimization Framework for Online RL

Policy Gradient Converges to the Globally Optimal Policy for Nearly Linear-Quadratic Regulators

QUANTILE-BASED POLICY OPTIMIZATION FOR REINFORCEMENT LEARNING

Trajectory-Oriented Policy Optimization with Sparse Rewards