Abstract:Background: Current health information understandability research uses medical readability formulas to assess the cognitive difficulty of health education resources. This is based on an implicit assumption that medical domain knowledge represented by uncommon words or jargon form the sole barriers to health information access among the public. Our study challenged this by showing that, for readers from non-English speaking backgrounds with higher education attainment, semantic features of English health texts that underpin the knowledge structure of English health texts, rather than medical jargon, can explain the cognitive accessibility of health materials among readers with better understanding of English health terms yet limited exposure to English-based health education environments and traditions. Objective: Our study explores multidimensional semantic features for developing machine learning algorithms to predict the perceived level of cognitive accessibility of English health materials on health risks and diseases for young adults enrolled in Australian tertiary institutes. We compared algorithms to evaluate the cognitive accessibility of health information for nonnative English speakers with advanced education levels yet limited exposure to English health education environments. Methods: We used 113 semantic features to measure the content complexity and accessibility of original English resources. Using 1000 English health texts collected from Australian and international health organization websites rated by overseas tertiary students, we compared machine learning (decision tree, support vector machine [SVM], ensemble tree, and logistic regression) after hyperparameter optimization (grid search for the best hyperparameter combination of minimal classification errors). We applied 10-fold cross-validation on the whole data set for the model training and testing, and calculated the area under the operating characteristic curve (AUC), sensitivity, specificity, and accuracy as the measurement of the model performance. Results: We developed and compared 4 machine learning algorithms using multidimensional semantic features as predictors. The results showed that ensemble tree (LogitBoost) outperformed in terms of AUC (0.97), sensitivity (0.966), specificity (0.972), and accuracy (0.969). Decision tree (AUC 0.924, sensitivity 0.912, specificity 0.9358, and accuracy 0.924) and SVM (AUC 0.8946, sensitivity 0.8952, specificity 0.894, and accuracy 0.8946) followed closely. Decision tree, ensemble tree, and SVM achieved statistically significant improvement over logistic regression in AUC, specificity, and accuracy. As the best performing algorithm, ensemble tree reached statistically significant improvement over SVM in AUC, specificity, and accuracy, and statistically significant improvement over decision tree in sensitivity. Conclusions: Our study shows that cognitive accessibility of English health texts is not limited to word length and sentence length as had been conventionally measured by medical readability formulas. We compared machine learning algorithms based on semantic features to explore the cognitive accessibility of health information for nonnative English speakers. The results showed the new models reached statistically increased AUC, sensitivity, and accuracy to predict health resource accessibility for the target readership. Our study illustrated that semantic features such as cognitive ability-related semantic features, communicative actions and processes, power relationships in health care settings, and lexical familiarity and diversity of health texts are large contributors to the comprehension of health information; for readers such as international students, semantic features of health texts outweigh syntax and domain knowledge.

A machine learning approach to reading level assessment

Text Readability Assessment for Second Language Learners

A Framework for Learning Assessment through Multimodal Analysis of Reading Behaviour and Language Comprehension

A Readable Read: Automatic Assessment of Language Learning Materials based on Linguistic Complexity

Linguistic Features Distinguishing Students' Writing Ability Aligned with CEFR Levels

Assessment Of Optimal Pedagogical Factors For Canadian Esl Learner'S Reading Literacy Through Artificial Intelligence Algorithms

Machine-Learned Computational Models Can Enhance the Study of Text and Discourse: A Case Study Using Eye Tracking to Model Reading Comprehension

Decoding Contextual Factors Differentiating Adolescents’ High, Average, and Low Digital Reading Performance Through Machine-Learning Methods

Linguistic Features for Readability Assessment

Assessment of Optimal Pedagogical Factors for Canadian ESL Learners’ Reading Literacy Through Artificial Intelligence Algorithms

Predicting Health Material Accessibility: Development of Machine Learning Algorithms

Diverse Linguistic Features for Assessing Reading Difficulty of Educational Filipino Texts

Feeding What You Need by Understanding What You Learned

Using Machine Learning and Natural Language Processing Techniques to Analyze and Support Moderation of Student Book Discussions

Measure Children’s Mindreading Ability with Machine Reading

Identifying key factors of reading achievement: A machine learning approach

Identifying Key Contextual Factors of Digital Reading Literacy Through a Machine Learning Approach

Efficient Measuring of Readability to Improve Documents Accessibility for Arabic Language Learners

Assessing Language Proficiency from Eye Movements in Reading

Synergistic effects of instruction and affect factors on high- and low-ability disparities in elementary students’ reading literacy

An Exploration of Impact Factors Influencing Students’ Reading Literacy in Singapore with Machine Learning Approaches