Abstract:BACKGROUND Objective measures such as vital signs and lab values only provide a partial view of a patient’s condition. Patient reported outcome measures (PROMs)1 and Patient reported experience measures (PREMs)1 are subjective reports shared by patients that can help complete this view by filling in gaps that other methods are incapable of assessing such as pain levels, patient experience, motivation, human factors, patient related outcomes and health priorities. Machine learning, the use of computer algorithms that improve automatically through experience, is a powerful tool in healthcare that often does not utilize subjective information shared by patients.2 Furthermore, earlier implementations of machine learning in medicine were developed without patient or public input and may be missing priorities and measures that matter to patients. Public and patient involvement can bring these measures together by defining end-user experience, meaning, patient priorities and implementation thus providing enriched data for machine learning and more functional PROMS and PREMs. Patient reported outcome measures (PROMs) are questionnaires measuring the patients’ views of their health status. Patient reported experience measures (PREMs) refer to data collected from patients on their experience within the health. These questionnaires can help understand the patient perspective to identify goals for care and evaluate the impact of care. Machine learning is an application of Artificial Intelligence (AI) that trains systems to automatically learn and improve from experience. In the past decade, machine learning has given us practical speech and speech to text recognition, algorithms for medical diagnosis, improvements in predictive epidemiology and public health, and prognostic treatment models. While this is a powerful tool, these algorithms are only as reliable and free from bias as the data that is used to build and train them.3 This review of reviews looks at ways to integrate machine learning with patient reported outcomes for the development of improved public and patient partnership in research and health care. OBJECTIVE What can we learn from existing systematic reviews about the best methods for combining machine learning and patient reported outcomes? 1. How are the public engaged as involved partners in the development of Artificial Intelligence (AI) in medicine? 2. What examples of good practice can we identify for the integration of Patient Reported Outcome Measures (PROMs) into machine learning algorithms? 3. How has value-based healthcare influenced the development of artificial intelligence in healthcare? METHODS Searches This review covers a broad range of interrelated topics where we will assess the overall data by conducting three separate scoping reviews. The first review will focus on the intersection of AI and PROMS. The second scoping review will focus on AI and public involvement. The third will focus on AI and value-based healthcare. We have chosen to do three separate scoping reviews instead of one or multiple systematic reviews in order to more efficiently identify knowledge gaps and investigate the way the research was conducted.4,5 Preliminary searches have indicated that large bodies of knowledge have been published concerning the integration of PROMs into statistical methods6,7,8, but few have indicated frameworks for public and patient involvement in the development of artificial intelligence. Search strategies for each review were developed with the team and reviewed by our information specialist (CS). Our search strategies utilize controlled terms and a range of techniques to optimise sensitivity. No language restrictions will be applied. Each review will include relevant date restrictions to further isolate informative and innovative research. The MEDLINE database will be used to identify initial search results. Initial search results will be reviewed to confirm there are no significant exclusions. Once the final search strategy has been identified, we will expand our search to the following information sources: Ovid MEDLINE(R), EMBASE, PsycINFO, Science Citation Index, Cochrane Library, Database of Abstracts of Reviews of Effects, PROSPERO. For search strategies see appendix-1 Types of Study to be Included We will include systematic reviews and overviews published in any language. Reviews will be included if they searched a minimum two databases, appraised the included studies, provided synthesis of the data and information retrieved. All findings will be reviewed and discussed by members of the author team until consensus is reached. Once a preliminary set of eligible studies has been identified for each review based on outcome measures and broad inclusion criteria, we will progress to the next stage of evaluation. Each eligible study will be further evaluated based on more narrow inclusion criteria in order to select the most relevant and informative research for each review. Types of Study to be Excluded Upon initial screening of title and abstract, we will exclude articles meeting any of the following criteria: • Papers not dealing with any form of or related forms of Artificial Intelligence • Papers in which no relevant outcomes are reported • Papers describing protocols for future studies • Papers dealing with animal models • Papers in which the full paper is not accessible Condition or Domain being studied We are investigating three domains. In the first scoping review, we will study examples of how the general public has been involved in Artificial Intelligence development where the outcomes include aspects of the trial and the experiences and perspectives of the public, participants, or researchers. The second review will focus on machine learning algorithms that have utilized Patient Reported Outcomes Measures (PROMs) to improve their performance on a healthcare related task. This will include any research study that is investigating the use of PROMs to improve diagnostic or treatment approaches. The third review will investigate artificial intelligence research that has focused on value-based care. Studies that have utilized artificial intelligence to investigate, evaluate, or design value-based care systems will be included. Public and Patient Involvement Patients and members of the public will be involved in the review and will be trained to screen titles and abstracts as well as risk or bias and quality assessment. They will be named as authors at that time if they have met the standards for authorship. Funding constraints and COVID-19 restrictions prevented us from involving them more actively in protocol building. Dissemination The research will be disseminated via social media and presented by the authors at conferences and convenings. The lessons learned and the findings will be used to teach our teens and young adult learners at the Stanford Anesthesia Summer Institute. RESULTS Main Outcomes The following outcomes will be considered: • Public involvement in artificial intelligence research planning, conduct, or management • Public Involvement in research analysis • Research recruitment, enrolment, and retention • Factors that affect cooperation and participation • Patient reported outcome measures (PROMs) • Patient reported experience measures (PREMs) • Ethics related to the inclusion of patient reported information in AI • Factors relating to participant interaction with AI • Barriers to acquiring PROMs and PREMs for use in AI research • Cost-effectiveness outcomes relating to inclusion of PROMs and PREMs in AI research Measures of effect Quantitative, qualitative, and mixed methods studies will be included in our reviews. If sufficient quantitative studies relating to the inclusion of PROMs in AI warrant a meta-analysis, we will perform it and calculate a weighted effect across the studies using a random effects model. After utilizing a random effects model, it may still be desirable to identify sources of heterogeneity. If this is the case, we will utilize a subgroup analysis approach to investigate the reasons for heterogeneity. Data Extraction The flow of information through the different stages of our review will be guided by the PRISMA flow chart.9 First, we will identify records through database searching and other sources as described in appendix 1. Relevant results from each database and source of information will then be downloaded into Zotero, a management software for managing research materials. Results will then be uploaded into Covidence, an online tool for screening references, for screening and analysis. After uploading into Covidence, we will remove duplicate records. Titles and abstract of potentially relevant articles will then be screened independently by at least two reviewers against the relevant inclusion criteria. Discrepancies will be resolved through discussion with the entire group when necessary. Individuals recruited from the Cochrane Task Exchange, Stanford Medicine X and Stanford Science Technology and Medicine Summer Internships will co-produce the study design and will be active in screening, data extraction, analysis, prioritizing what to report, and editing and authoring tasks. After excluding initial search results that do not meet our inclusion criteria, we will begin to review the full text of included records. Full text review will be conducted by at least two authors with an additional author reserved to mediate areas where agreement is uncertain. Authors will then come to agreement through discussion. The full paper review will result in the final set of included records. The authors will provide tables to show the characteristics of the included studies as seen in Table 1, and an additional table to show author, year, exclusion reason for excluded full studies as seen in Table 2. Table 1: Table of Included Studies Study Name Intervention Enablers Barriers Outcomes Results Example Study intervention Used Enabler to intervention, if any Barrier to intervention, if any Measures of outcome used Result of intervention Table 2: Table of Excluded Studies Study Name and Year Exclusion Reason Example Study Reason for Exclusion Risk of bias (quality) assessment For quantitative studies, we will utilize the Grading of Recommendations Assessment, Development and Evaluation (GRADE) approach.10 This approach provides a structured and transparent evaluation for summarizing evidence for reviews. The GRADE approach classifies the quality of evidence of quantitative studies into one of four levels including high, moderate, low, and very low. The ratings of the quality of evidence attempt to describe how much confidence there is that the true effect leis close to that of the estimated effect. Confidence in the Evidence from Reviews of Qualitative Research (CERQual)11,12 will be used to summarize confidence in the findings of the qualitative reviews. This is based on four components: limitations of methodology, relevance to the research question, coherence, and the adequacy of the data presented. CerQual enables ratings of “high”, “medium”, “low” and “very low”. The starting point of ‘high confidence’ reflects that each review finding is a reasonable representation of the question of interest and is downgraded if there are factors that would weaken this assumption. After assessing all four components independently, authors will agree on overall confidence for each review finding and the relevance to the review of reviews. Strategy for data synthesis For the study investigating public involvement, we will utilize a relational analysis to present our results. Broadly, a relational analysis is a type of content analysis in which concepts found in our review will be further analysed by how they relate to each other. We are most interested in approaches to public involvement in AI research, as well as enablers and barriers to those approaches. With this technique we will be able to utilize data from eligible sources to identify examples of strategies, enablers, barriers, and outcomes. Once we have identified these examples in our eligible sources of information, we will be able to visually present this data in a flow chart and discuss these observations within the discussion. The template for how this chart will look can be seen in Figure 1. For the review focusing on PROMs, we will chart the difference evidences used and outcomes collected for different algorithms that utilize PROMs as seen in Figure 2. In this review we are most interested in how PROMs are integrated into AI tools and what outcomes result from their use. Finally, for the study focusing on Value Based Care, we will also utilize a table similar to Figure 2. Figure 1: Figure 2: Limitations We have chosen to do three scoping reviews instead of full systematic reviews because research has indicated the scoping reviews will help us more efficiently answer our research questions.4 In a scoping review, the goal is to determine what evidence is available, not to synthesize evidence from multiple study designs and provide concrete guidance.13 This is because scoping reviews are limited in their ability to provide concrete guidance. However, we are only aiming to examine the types of available evidence in this field and to identify key factors related to our topics. We are attempting to identify methods for including patient involvement in machine learning and explore how value-based care has impacted machine learning. We are not aiming to produce a specific answer to a specific clinical or policy making question, so this limitation is acceptable. Furthermore, scoping reviews generally provide an overview of existing evidence, regardless of quality.13 In our scoping reviews, we will be using the GRADE approach and the CERQual to assess the quality of our sources. In this approach, authors will discuss the qualities of reviews and determine whether the review should be included. Thus, we are directly addressing this limitation, and still believe a scoping review is the right choice for each topic. CONCLUSIONS Conclusions By executing this proposed protocol, we are hoping to identify examples of good practice for how to include public involvement in the development of machine learning systems. We hope to identify enablers and barriers to public involvement, as well as interventions and outcomes that have utilized PROMs and PREMs. Lastly, we hope to identify examples of how value-based healthcare has influenced the development of AI system in healthcare.

A Preliminary Checklist (METRICS) to Standardize the Design and Reporting of Studies on Generative Artificial Intelligence–Based Models in Health Care Education and Practice: Development Study Involving a Literature Review

STAGER checklist: Standardized Testing and Assessment Guidelines for Evaluating Generative AI Reliability

Reporting guidelines in medical artificial intelligence: a systematic review and meta-analysis

Meta-research on reporting guidelines for artificial intelligence: are authors and reviewers encouraged enough in radiology, nuclear medicine, and medical imaging journals?

The Minimum Information about CLinical Artificial Intelligence Checklist for Generative Modeling Research (MI-CLAIM-GEN)

Evolution of Research Reporting Standards: Adapting to the Influence of Artificial Intelligence, Statistics Software, and Writing Tools

APPRAISE-AI Tool for Quantitative Evaluation of AI Studies for Clinical Decision Support

Combining Machine Learning, Patient Reported Outcomes and Value Based Healthcare: A Protocol for Scoping Reviews (Preprint)

Presenting artificial intelligence, deep learning, and machine learning studies to clinicians and healthcare stakeholders: an introductory reference with a guideline and a Clinical AI Research (CAIR) checklist proposal

Reporting guidelines for clinical trials of artificial intelligence interventions: the SPIRIT-AI and CONSORT-AI guidelines

ChatGPT Utility in Health Care Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns

Guidelines, Consensus Statements, and Standards for the Use of Artificial Intelligence in Medicine: Systematic Review

The GRACE checklist for rating the quality of observational studies of comparative effectiveness: a tale of hope and caution

Clinician checklist for assessing suitability of machine learning applications in healthcare

A clinician's guide to understanding and critically appraising machine learning studies: a checklist for Ruling Out Bias Using Standard Tools in Machine Learning (ROBUST-ML)

Designing an ML Auditing Criteria Catalog as Starting Point for the Development of a Framework

PRISMA AI reporting guidelines for systematic reviews and meta-analyses on AI in healthcare

Advancing Artificial Intelligence for Clinical Knowledge Retrieval: A Case Study Using ChatGPT-4 and Link Retrieval Plug-In to Analyze Diabetic Ketoacidosis Guidelines

Generative Artificial Intelligence in Healthcare: Ethical Considerations and Assessment Checklist

GPT for RCTs? Using AI to measure adherence to reporting guidelines

Guidelines and Standard Frameworks for Artificial Intelligence in Medicine: A Systematic Review