Yuxiang Qiu,Karim Djemili,Denis Elezi,Aaneel Shalman,María Pérez-Ortiz,Emine Yilmaz,John Shawe-Taylor,Sahan Bulathwela
Abstract:With the advancement and utility of Artificial Intelligence (AI), personalising education to a global population could be a cornerstone of new educational systems in the future. This work presents the PEEKC dataset and the TrueLearn Python library, which contains a dataset and a series of online learner state models that are essential to facilitate research on learner engagement modelling.TrueLearn family of models was designed following the "open learner" concept, using humanly-intuitive user representations. This family of scalable, online models also help end-users visualise the learner models, which may in the future facilitate user interaction with their models/recommenders. The extensive documentation and coding examples make the library highly accessible to both machine learning developers and educational data mining and learning analytics practitioners. The experiments show the utility of both the dataset and the library with predictive performance significantly exceeding comparative baseline models. The dataset contains a large amount of AI-related educational videos, which are of interest for building and validating AI-specific educational recommenders.
What problem does this paper attempt to address?
The problems that this paper attempts to solve are:
1. **Lack of publicly available educational video engagement datasets**: Most of the currently publicly available datasets are mainly focused on question - answering and testing scenarios, lacking detailed data related to educational video - viewing behaviors. This restricts researchers' ability to develop and validate personalized educational recommendation systems.
2. **Deficiencies of existing models in large - scale, continuous - learning environments**: Traditional Knowledge Tracing (KT) and Item Response Theory (IRT) models are mainly targeted at limited learning materials and testing scenarios and cannot effectively support the personalized learning needs in large - scale, continuous - learning environments.
3. **Improving the prediction performance of educational video recommendation systems**: Existing educational recommendation systems fail to fully utilize the implicit signals of user - video interactions (such as clicks, viewing duration, etc.), making it difficult to achieve efficient and personalized learning support.
For this purpose, the paper makes two main contributions:
- **Creation and release of the PEEKC dataset**: This is a publicly available dataset containing more than 20,000 informal learners watching AI - related educational videos. Each video segment is annotated with relevant Knowledge Components (KCs). These data are collected in a real - world environment and can better reflect learners' natural learning behaviors.
- **Development of the TrueLearn library**: This is an open - source Python library that contains the latest Bayesian online learning models and visualization tools for modeling learners' interests, knowledge, and novelty. The design of this library follows the "open learner" concept, uses an intuitive user representation, and provides multiple visualization methods to help users understand and manage their own learning states.
Through these two contributions, the paper aims to promote the research and development of personalized educational recommendation systems, especially in large - scale, continuous - learning environments, by using implicit interaction signals to enhance learners' engagement and learning effectiveness.
### Formula Summary
The main formulas involved in the paper are as follows:
1. **Knowledge component coverage of resources**:
\[
SR(c, c')=\log\left(\frac{\max(|L_c|, |L_{c'}|)}{|L_c\cap L_{c'}|}\right)-\log\left(\frac{\min(|L_c|, |L_{c'}|)}{|W|}\right)
\]
where \( L_c \) represents the set of concepts linked to the Wikipedia concept \( c \), and \( W \) represents the set of all Wikipedia topics.
2. **Cosine similarity calculation**:
\[
\cos(\text{str}, c)=\frac{\text{TFIDF}(\text{str})\cdot\text{TFIDF}(c)}{\|\text{TFIDF}(\text{str})\|\times\|\text{TFIDF}(c)\|}
\]
where \( \text{TFIDF}(s) \) returns the TF - IDF vector of the string \( s \), and \( \|\cdot\| \) represents the norm of the vector.
3. **Normalized viewing time**:
\[
e_t^{\ell, r_i}=\frac{W(\ell, r_i)}{D(r_i)}
\]
where \( W(\cdot) \) is a function that returns the viewing time of the learner \( \ell \) for the resource \( r_i \), and \( D(\cdot) \) is a function that returns the duration of the lecture segment \( r_i \).
These formulas are used to process and analyze educational video data to model learners' engagement and learning states.