Abstract:Background and rationale: Knee osteoarthritis (OA) is a common disease characterized by reduced function, stiffness, and pain. This clinical diagnosis is commonly supported with radiography of the weight-bearing knee. Radiographic features, such as the Kellgren-Lawrence (KL) grading system, are used as eligibility criteria for clinical studies while others, such as the OARSI grades and minimal joint space width, are used as endpoints for structural OA progression. A higher preoperative KL-grade has been correlated with better pain- and functional outcomes after knee arthroplasty surgery. Consequently, the KL-grade is a common requirement for approving knee arthroplasty among American health insurance providers and it is commonly used by orthopedic surgeons as part of determining knee arthroplasty candidacy. Historically, a radiologist was required to draw on and grade radiographs of the knee to extract the features. With increasing computational power and the increased use of deep convolutional neural networks, off-the-shelf artificial intelligence (AI) tools have become available for automatic extraction of these features. They have received regulatory approval for commercialization but it is apparent that more diligent external validation is required. Finally, as AI tools begin to mature, new versions are released. It is important to assess how these developments change the current performance of the tool. Objectives: The aim of this analysis is to evaluate the performance of a commercially available AI tool and of readers with different experience levels in orthopedic surgery and radiology at clinically relevant Kellgren-Lawrence grading system thresholds. Methods: This study is a secondary analysis of the data from the AutoRayValid-RBknee study, a retrospective observer performance study. It consists of non-fixed-flexion radiographs acquired from the production picture archiving and communications system (PACS) from three European centers. The primary outcome will be the difference in area under the receiver operating curve (AUC) between the readers and the AI tool at the prior authorization clinical criteria threshold (KL ≥ 3). Key secondary outcomes will be radiographic knee osteoarthritis (KL ≥ 2), osteoarthritis clinical trial inclusion (2 ≤ KL ≤ 3), and weight-loss trial inclusion (1 ≤ KL ≤ 3). The AUC of the readers will be computed using the SROC approach as proposed by Oakden-Rayner et al. Further, the performance of the AI tool for grading ordinal OARSI grades will be evaluated using the ordinal ROC as proposed by Obuchowski et al. and the AUC is used for estimating binary OARSI-grade and patellar osteophyte classification performance. Population: Patients with knee pain referred for radiography on suspicion of knee osteoarthritis. Index test: Readers: Each center will recruit four readers from across radiology and orthopedic surgery, one in-training and one board-certified for each specialty. AI tool: RBknee-2.2.0 (CE version, KL-grading, OARSI grading, patellar osteophytes) and RBknee-2.1.0 (CE version, KL-grading, OARSI grading, patellar osteophytes) will be used to perform the change impact analysis of advancing product development. Reference test: The reference standard will be determined by the majority vote of three readers, one from each participating hospital who are a board-certified musculoskeletal radiology consultant with expertise in clinical and research evaluation of KOA including extensive experience using the KL-grade. Further statistical details Sample size: Not applicable as this is a secondary analysis. Framework: This is a diagnostic test accuracy study assessing the performance of a commercially available AI tool for radiographic evaluation of knee osteoarthritis according to established grading systems. Additionally, change impact analysis will be performed where multiple versions of the AI tool are available. Confidence intervals and P values: All 95% confidence intervals and P values will use an alpha of 5%. Multiplicity: No explicit multiplicity correction will be performed. Instead, a hierarchical approach will be taken based on tabular order of the tested hypotheses in Table 3. Statistical software: R version 4.2.2 (or newer).

A case study regarding clinical performance evaluation method of medical device software for approval

Rationale and Design of Individualized Quality Improvement Based on the Computer Analysing System to Improve Stroke Management Quality Evaluation (CASE): a Multicenter Historically Controlled Study

Performance evaluation methods for improvements at post-market of artificial intelligence/machine learning-based computer-aided detection/diagnosis/triage in the United States

Systematic analysis of the test design and performance of AI/ML-based medical devices approved for triage/detection/diagnosis in the USA and Japan

Toward standardized premarket evaluation of computer aided diagnosis/detection products: insights from FDA-approved products

Regulation and Clinical Investigation of Medical Device in the European Union

Performance and change impact analysis of a commercial artificial intelligence tool for radiographic knee osteoarthritis grading and joint space width measuring

Methodology for Conducting Post-Marketing Surveillance of Software as a Medical Device Based on Artificial Intelligence Technologies

[Construction of a methodology for clinical evaluation of medical devices using simulation tools and illustration through three studies]

Performance of readers and an artificial intelligence tool for grading of radiographic knee osteoarthritis at prespecified thresholds: Statistical analysis plan

Performance Assessment of Artificial Intelligence Medical Device Software Using Synthetic Data.

Meta-analysis of the Technical Performance of an Imaging Procedure: Guidelines and Statistical Methodology

The clinical use of remote parameter testing during cardiac implantable electronic devices implantation procedures: a single center, randomized, open-label, non-inferiority trial

The challenging landscape of medical device approval in localized prostate cancer

BT09 Clinical performance of an artificial intelligence-based medical device deployed within an urgent suspected skin cancer pathway

Clinical Validation of Computer-Aided Diagnosis Software for Preventing Retained Surgical Sponges

Proof‐of‐concept study of artificial intelligence‐assisted review of CBCT image guidance

AI as a Medical Device for Ophthalmic Imaging in Europe, Australia, and the United States: Protocol for a Systematic Scoping Review of Regulated Devices

Preclinical Evaluation of a Novel Steerable Robotic Neuroendoscope Tool

Insights into post-marketing clinical validation of companion diagnostics with reference to the FDA, EMA, PMDA, and MFDS

Creation of objective performance criteria among medical devices