Leveraging Granularity: Hierarchical Reinforcement Learning for Pedagogical Policy Induction

Guojing Zhou,Hamoon Azizsoltani,Markel Sanz Ausin,Tiffany Barnes,Min Chi
DOI: https://doi.org/10.1007/s40593-021-00269-9
2021-08-16
International Journal of Artificial Intelligence in Education
Abstract:In interactive e-learning environments such as Intelligent Tutoring Systems, pedagogical decisions can be made at different levels of granularity. In this work, we focus on making decisions at two levels: whole problems vs. single steps and explore three types of granularity: problem-level only (Prob-Only), step-level only (Step-Only) and both problem and step levels (Both). More specifically, for Prob-Only, our pedagogical agency decides whether the next problem should be a worked example (WE) or a problem-solving (PS). In WEs, students observe how the tutor solves a problem while in PSs students solve the problem themselves. For Step-Only, the agent decides whether to elicit the student's next solution step or to tell the step directly. Here the student and the tutor co-construct the solution and we refer to this type of task as collaborative problem-solving (CPS). For Both, the agency first decides whether the next problem should be a WE, a PS, or a CPS and based on the problem-level decision, the agent then makes step-level decisions on whether to elicit or tell each step. In a series of classroom studies, we compare the three types of granularity under random yet reasonable pedagogical decisions. Results showed that while Prob-Only may be less effective for High students, Step-Only may be less effective for Low ones, Both can be effective for both High and Low students. Motivated by these findings, we propose and apply an offline, off-policy Gaussian Processes based Hierarchical Reinforcement Learning (HRL) framework to induce a hierarchical pedagogical policy that makes adaptive, effective decisions at both the problem and step levels. In an empirical classroom study, our results showed that the HRL policy is significantly more effective than a Deep Q-Network (DQN) induced step-level policy and a random yet reasonable step-level baseline policy.
What problem does this paper attempt to address?