Motor Symptom Machine Rating System for Complete MDS-UPDRS III in Parkinson's Disease: A Proof-of-concept Pilot Study.
Xue Zhu,Zhonglue Chen,Yun Ling,Ningdi Luo,Qianyi Yin,Yichi Zhang,Aonan Zhao,Guanyu Ye,Haiyan Zhou,Jing Pan,Liche Zhou,Linghao Cao,Pei Huang,Pingchen Zhang,Cheng Chen,Weikun Shi,Shinuan Lin,Haimei Zhuang,Jin Zhao,Kang Ren,Yuyan Tan,Jun Liu
DOI: https://doi.org/10.1097/cm9.0000000000003044
IF: 6.133
2024-01-01
Chinese Medical Journal
Abstract:To the Editor: Parkinson's disease (PD) is one of the most common neurodegenerative movement disorders.[1] The severity of PD-related motor symptoms is usually semiquantitatively ("normal", "slight", "mild", "moderate", and "severe") evaluated by expert physicians according to the Movement Disorder Society-sponsored revision of the Unified Parkinson's Disease Rating Scale Part III (MDS-UPDRS III).[2] However, the MDS-UPDRS III is semiquantitative and subjective, which might mask mild treatment effects or even provide false-positive results. Different kinds of artificial intelligence (AI) technologies and telemedicine[3,4] have been explored in patient evaluation to address the above-mentioned problems. However, items that assess rigidity based on the instruction of the MDS-UPDRS III were not included in previous studies. Five ratings for "Rigidity" (including upper extremity [UE], lower extremity [LE], and neck) and one rating for "Postural Stability" of the MDS-UPDRS III requiring physical contact between the physician and the patient cannot be performed in machine vision. Our team previously managed to construct models for rigidity evaluation based on features collected from patients' motions through machine vision,[5] which makes machine rating of the entire MDS-UPDRS III possible. A good machine-rating model of the MDS-UPDRS III can improve the accuracy, objectivity, and consistency of the evaluation of clinical symptoms and treatment efficacy. Previous studies have made remarkable achievements, but one unified method for rating the entire spectrum of MDS-UPDRS III through video has never been achieved, which largely compromises its application in clinical use. To our knowledge, the highest coverage rate of the 33 subitems of the MDS-UPDRS III in previous single studies was 45.45%.[4] Our study introduced a method based on machine vision and machine learning to achieve a machine rating of the entire MDS-UPDRS III scale for PD patients (details of study design in Supplementary Method 1, https://links.lww.com/CM9/B939). The study was approved by Ruijin Hospital Ethics Committee, Shanghai Jiao Tong University School of Medicine (No. 2015-99). Written informed consent was obtained from each participant. Our study included 2610 videos from 149 PD patients (Supplementary Table 1, https://links.lww.com/CM9/B939, inclusion and exclusion criteria detailed in Supplementary Method 2, https://links.lww.com/CM9/B939). The distributions of each MDS-UPDRS III subitem, total score, subscale score, and the Hoehn and Yahr (H-Y) stage among the 149 subjects are shown in Supplementary Figure 1, https://links.lww.com/CM9/B939. See Supplementary Method 3, https://links.lww.com/CM9/B939 for details of technical methods (including equipment, filming process, feature engineering, modelling, and statistical analysis). Eighteen direct rating models and four indirect rating models were constructed, and the results are shown in Table 1. The 22 models covered the entire MDS-UPDRS III scale. In total, 77.8% of the rating models achieved an ACCStandard (the machine rating score equaled either score-by-rater-I or score-by-rater-II, above clinical level) reaching 80% or higher. All rating models (100%) had an ACCAbsolute (machine rating score equaled score-by-rater-final, equivalent to clinical level) reaching 70% due to its more stringent criteria. The intraclass correlation coefficient (ICC) of the 90.9% rating models was higher than 0.40 (fair). Eleven models achieved ICCs greater than 0.75 (excellent). They were "3.1 Speech", "3.3a Rigidity-Neck", "3.3c Rigidity-LE", "3.4 Finger Tapping", "3.7 Toe Tapping", "3.8 Leg Agility", "3.10 Gait", "3.12 Postural Stability", "3.14 Global Spontaneity of Movement", and "3.17c Rest Tremor Amplitude LE". Seven models had ICCs between 0.60 and 0.74 (good). They were "3.2 Facial Expression", "3.3b Rigidity-UE", "3.5 Hand Movements", "3.6 Pronation-Supination Movements of Hands", "3.11 Freezing of Gait", "3.15 Postural Tremor", and "3.18 Constancy of Rest Tremor". The ICCs of three models were at the "fair" level (0.40–0.59), including "3.13 Posture", "3.16 Kinetic Tremor", and "3.17a Rest Tremor Amplitude Lip/Jaw". The ICCs of the remaining two models ("3.9 Arising from Chair" and "3.17b Rest Tremor Amplitude UE") were at the "poor" level. Our model covered 100% of all MDS-UPDRS III subitems. Model performance varied in different models but overall demonstrated the feasibility of our model construction. Table 1 - Performance comparison of different evaluation indicators. MDS-UPDRS III ITEM ACCAbsolute ACCAcceptable ICC (95% CIs, Level) MDS-UPDRS III ITEM ACCAbsolute ACCAcceptable ICC (95% CIs, Level) 3.1 Speech 0.93 / 0.93* (0.90–0.95, Excellent) 3.10 Gait 0.88 0.90 0.80* (0.74–0.85, Excellent) 3.2 Facial Expression 0.71 0.71 0.63* (0.52–0.72, Good) 3.11 Freezing of Gait 0.94 0.96 0.74* (0.65–0.81, Excellent) 3.3a Rigidity-Neck 0.85 / 0.76* (0.68–0.83, Excellent) 3.12 Postural Stability 0.87 0.85 0.87* (0.82–0.90, Excellent) 3.3b Rigidity-UE 0.73 / 0.62* (0.54–0.69, Good) 3.13 Posture 0.70 0.74 0.54* (0.42–0.65, Fair) 3.3c Rigidity-LE 0.71 / 0.78* (0.73–0.82, Excellent) 3.14 Global Spontaneity of Movement 0.91 0.91 0.83* (0.77–0.88, Excellent) 3.4 Finger Tapping 0.73 0.71 0.78* (0.73–0.82, Excellent) 3.15 Postural Tremor 0.81 0.86 0.60* (0.51–0.67, Good) 3.5 Hand Movements 0.70 0.75 0.69* (0.62–0.74, Good) 3.16 Kinetic Tremor 0.79 0.85 0.51* (0.39–0.60, Fair) 3.6 Pronation-Supination Movements of Hands 0.76 0.80 0.73* (0.68–0.78, Good) 3.17a Rest Tremor Amplitude Lip/Jaw 0.88 0.89 0.41* (0.29–0.52, Fair) 3.7 Toe Tapping 0.78 0.80 0.79* (0.75–0.83, Excellent) 3.17b Rest Tremor Amplitude UE 0.91 0.91 0.38* (0.27–0.47, Poor) 3.8 Leg Agility 0.70 0.82 0.78* (0.73–0.82, Excellent) 3.17c Rest Tremor Amplitude LE 0.98 0.97 0.80* (0.73–0.85, Excellent) 3.9 Arising from Chair 0.75 0.82 0.28* (0.13–0.43, Poor) 3.18 Constancy of Rest Tremor 0.84 0.86 0.72* (0.63–0.79, Good) *P <0.001. ACCAbsolute: Machine rating score equaled score-by-rater-final, equivalent to clinical level; ACCAcceptable: Machine rating score equaled either score-by-rater-I or score-by-rater-II, above clinical level; CI: Confidence interval; ICC: Intraclass correlation coefficient; LE: Lower extremity; MDS-UPDRS III: Movement Disorder Society-sponsored revision of the Unified Parkinson's Disease Rating Scale Part III; UE: Upper extremity. Four indirect rating models of the subitems "Speech" and "Rigidity" were constructed by using different model inputs [Supplementary Table 2, https://links.lww.com/CM9/B939]. The feature-based model utilized motion and signal features as inputs, while the item-based model employed the score-by-rater-final of all subitems except "Speech" and "Rigidity" as inputs. The ACCAbsolute of the feature-based model had significant advantages over the item-based model. The indirect rating models using motion and signal features as input performed better, indicating that motion and signal features carried more abundant and accurate information than the scores of the semiquantitative scale. Thus, motion and signal features had the potential to describe the severity of motor symptoms more accurately than the original scale of the MDS-UPDRS III. The performances of the adder model of the MDS-UPDRS III total score and subscale scores are shown in Supplementary Table 3, https://links.lww.com/CM9/B939. Regarding the error of the models, the mean absolute errors (MAEs) of the total score, tremor subscale score, rigidity subscale score, bradykinesia, and axial subscale score were 3.98, 1.74, 1.3, 2.18, and 0.94, respectively. The root mean square errors (RMSEs) of the corresponding models were 5.34, 2.93, 1.95, 2.81, and 1.57, respectively. Regarding the consistency of the models, four models (including the total score and the rigidity, bradykinesia, and axial subscale scores) achieved ICCs greater than 0.75 (excellent), and only the ICC of the tremor subscale score was at the level of "good" (0.69, confidence interval [CI]: 0.58–0.76). Regarding correlations, all the models exhibited greater than or equal to a "strong" correlation (greater than 0.80) between the ratings from experts and model prediction. The models with a "very strong" correlation were the model of the total score (0.93, 95% CI: 0.89–0.95) and the subscale score of bradykinesia (0.94, 95% CI: 0.91–0.96). The other three models for the subscale score (tremor, rigidity, and axial) had a correlation level of "strong". Our study explored video-based machine vision and machine learning technology for the complete machine rating of the MDS-UPDRS III and proved its feasibility. Our study had the following advantages: (1) machine rating of all 33 subitems, 4 subscales, and the total score of the MDS-UPDRS III scale based on machine vision was realized; (2) selection of features and models possessing reasonable clinical interpretability; (3) 2610 videos from 149 subjects were included to ensure the sample size and its representativeness, and (4) two different levels of evaluation criteria, "above clinical level" (ACCStandard) and "equivalent to clinical level" (ACCAbsolute), were designed to evaluate the performance of all rating models. The machine vision-based model has the potential to provide additional objective and quantitative information for clinical observations. For example, for "Protonation-Supination Movements of Hands", a neurologist could only use the MDS-UPDRS III to subjectively rate the patient's performance by counting the rhythm and roughly estimating the speed and amplitude of the patient's motion. Our model can precisely detect the halt time, speed, and amplitude (kinematic features). The minimum value of the cross-sectional area (CSA) signal (signal feature) also provided integrated information on the rhythm, speed, and amplitude undetectable by the naked eye [Supplementary Result 1, https://links.lww.com/CM9/B939]. Clinical interpretability is an important basis for the clinical application of machine learning technology. Few studies have discussed feature selection and model selection from the perspective of clinical interpretability. For direct rating models, we extracted features based on the description of the scale to make it more consistent with clinical observation. Indirect rating model features were selected from the evaluation motions of other relevant subitems. Features of "Rigidity" of limbs were extracted from other active motions involving the relevant muscle group. "Speech" and "Rigidity-Neck" did not have relevant active motion. Their severity was hypothesized to be positively correlated with disease severity, and features were extracted from all other subitems [Supplementary Result 2, https://links.lww.com/CM9/B939]. To explore the optimal classifier for the study, we compared three models, Extreme Gradient Boosting (XGBoost), ordered logistic regression (OLR), and support vector machine (SVM), and finally found that XGBoost exhibited excellent performance for multiclass classification of items on the MDS-UPDRS III [Supplementary Table 4, https://links.lww.com/CM9/B939]. Certain limitations existed in this study and warrant further attention: (1) The samples were collected from a movement disorders clinic, and the outpatient population mainly consisted of patients with mild-to-moderate stages of PD. A more balanced subject pool covering all H-Y stages could increase the generalizability of this model (Supplementary Results 3,4, https://links.lww.com/CM9/B939). (2) Our study was a single-center study. Multicenter verification is needed to expand the adaptability of the research results. (3) The ICCs of the rating models of "Arising from Chair" and "Rest Tremor Amplitude LE" were lower than 0.4. The inadequate consistency of these two models may be attributed to an insufficient number of subjects with ratings of 1, 2, 3, and 4. The extracted features cannot thoroughly represent the movement performance of subjects with more advanced symptoms. However, their ACCAbsolute values were 0.75 and 0.91, which demonstrated the feasibility of our model. In future research, we will further optimize the performance and generalization ability of rating models through multicenter research for the collection of a larger and more balanced dataset. At the same time, due to the high sensitivity of machine vision-based motion perception technology and its ability to capture spatial motion comprehensively, the technical system derived from this research is expected to build a new clinical evaluation system for motor symptoms with greater dimensions, higher precision, and finer granularity than the current scale. Acknowledgments The authors thank all the study participants and their families for their participation in this study. Conflicts of interest None.