Mitigating Exposure Bias in Online Learning to Rank Recommendation: A Novel Reward Model for Cascading Bandits

Masoud Mansoury,Bamshad Mobasher,Herke van Hoof

2024-08-08

Abstract:Exposure bias is a well-known issue in recommender systems where items and suppliers are not equally represented in the recommendation results. This bias becomes particularly problematic over time as a few items are repeatedly over-represented in recommendation lists, leading to a feedback loop that further amplifies this bias. Although extensive research has addressed this issue in model-based or neighborhood-based recommendation algorithms, less attention has been paid to online recommendation models, such as those based on top-K contextual bandits, where recommendation models are dynamically updated with ongoing user feedback. In this paper, we study exposure bias in a class of well-known contextual bandit algorithms known as Linear Cascading Bandits. We analyze these algorithms in their ability to handle exposure bias and provide a fair representation of items in the recommendation results. Our analysis reveals that these algorithms fail to mitigate exposure bias in the long run during the course of ongoing user interactions. We propose an Exposure-Aware reward model that updates the model parameters based on two factors: 1) implicit user feedback and 2) the position of the item in the recommendation list. The proposed model mitigates exposure bias by controlling the utility assigned to the items based on their exposure in the recommendation list. Our experiments with two real-world datasets show that our proposed reward model improves the exposure fairness of the linear cascading bandits over time while maintaining the recommendation accuracy. It also outperforms the current baselines. Finally, we prove a high probability upper regret bound for our proposed model, providing theoretical guarantees for its performance.

Information Retrieval

What problem does this paper attempt to address?

The paper attempts to address the issue of mitigating exposure bias in online learning to rank recommendation systems. Specifically, items and suppliers in recommendation systems often do not get equal opportunities to be displayed in the recommendation results. A few items are overexposed, while the majority of items rarely appear in the recommendation list. This bias can form a feedback loop over time, further amplifying the bias. Although a large body of research has focused on the issue of exposure bias in model-based or neighborhood-based recommendation algorithms, this problem has not been sufficiently addressed in online recommendation models (such as those based on top-𝐾 contextual bandits). These models typically update dynamically based on continuous user feedback. This paper specifically studies the issue of exposure bias in linear cascading bandits algorithms, analyzes the capability of these algorithms in handling exposure bias, and proposes an Exposure-Aware reward model. This model updates the parameters based on two factors: 1) implicit user feedback; 2) the position of the item in the recommendation list. The model mitigates exposure bias by controlling the utility distribution of items in the recommendation list. Experimental results show that the model not only improves the performance of linear cascading bandits in recommendation accuracy but also significantly enhances exposure fairness. Additionally, the authors demonstrate that the model has a high-probability upper bound on regret, providing theoretical guarantees for its performance.

Mitigating Exposure Bias in Online Learning to Rank Recommendation: A Novel Reward Model for Cascading Bandits

Unbiased Cascade Bandits: Mitigating Exposure Bias in Online Learning to Rank Recommendation

Exposure-Aware Recommendation using Contextual Bandits

Fairness of Exposure in Dynamic Recommendation

Learning with Exposure Constraints in Recommendation Systems

Cascading Bandits for Large-Scale Recommendation Problems

Clinical Online Recommendation with Subgroup Rank Feedback

Bandit Learning to Rank with Position-Based Click Models: Personalized and Equal Treatments

Misalignment, Learning, and Ranking: Harnessing Users Limited Attention

Cascading Bandits: Optimizing Recommendation Frequency in Delayed Feedback Environments.

Counterfactual contextual bandit for recommendation under delayed feedback

Achieving User-Side Fairness in Contextual Bandits

Ranking with Popularity Bias: User Welfare under Self-Amplification Dynamics

Fairness-aware Bandit-based Recommendation

Design Principles of Robust Multi-Armed Bandit Framework in Video Recommendations

Modeling item exposure and user satisfaction for debiased recommendation with causal inference

Achieving Counterfactual Fairness for Causal Bandit.

The Nah Bandit: Modeling User Non-compliance in Recommendation Systems

Adaptively Learning to Select-Rank in Online Platforms

Modeling and Counteracting Exposure Bias in Recommender Systems

Low-rank Bandits with Latent Mixtures