Cardiac manifestations of ulcerative colitis.

A. Blum,Rafea Shalabi,Tamar Brofman,I. Shajrawi

2009-12-01

Abstract:

What problem does this paper attempt to address?

A Robust Phenotype-Driven Likelihood Ratio Analysis Approach Assisting Interpretable Clinical Diagnosis of Rare Diseases.

Jian Yang,Liqi Shu,Huilong Duan,Haomin Li

DOI: https://doi.org/10.1016/j.jbi.2023.104372

IF: 8

2023-01-01

Journal of Biomedical Informatics

Abstract:Phenotype-based prioritization of candidate genes and diseases has become a well-established approach for multi-omics diagnostics of rare diseases. Most current algorithms exploit semantic analysis and probabilistic statistics based on Human Phenotype Ontology and are commonly superior to naive search methods. However, these algorithms are mostly less interpretable and do not perform well in real clinical scenarios due to noise and imprecision of query terms, and the fact that individuals may not display all phenotypes of the disease they belong to. We present a Phenotype-driven Likelihood Ratio analysis approach (PheLR) assisting interpretable clinical diagnosis of rare diseases. With a likelihood ratio paradigm, PheLR estimates the posterior probability of candidate diseases and how much a phenotypic feature contributes to the prioritization result. Benchmarked using simulated and realistic patients, PheLR shows significant advantages over current approaches and is robust to noise and inaccuracy. To facilitate clinical practice and visualized differential diagnosis, PheLR is implemented as an online web tool (https://phelr.nbscn.org).
Towardcross-Platformelectronic Health Record-Drivenphenotyping Using Clinical Quality Language

Pascal S Brandt,Richard C Kiefer,Jennifer A Pacheco,Prakash Adekkanattu,Evan T Sholle,Faraz S Ahmad,Jie Xu,Zhenxing Xu,Jessica S Ancker,Fei Wang,Yuan Luo,Guoqian Jiang,Jyotishman Pathak,Luke V Rasmussen

DOI: https://doi.org/10.1002/lrh2.10233

2020-01-01

Learning Health Systems

Abstract:Introduction Electronic health record (EHR)-driven phenotyping is a critical first step in generating biomedical knowledge from EHR data. Despite recent progress, current phenotyping approaches are manual, time-consuming, error-prone, and platform-specific. This results in duplication of effort and highly variable results across systems and institutions, and is not scalable or portable. In this work, we investigate how the nascent Clinical Quality Language (CQL) can address these issues and enable high-throughput, cross-platform phenotyping. Methods We selected a clinically validated heart failure (HF) phenotype definition and translated it into CQL, then developed a CQL execution engine to integrate with the Observational Health Data Sciences and Informatics (OHDSI) platform. We executed the phenotype definition at two large academic medical centers, Northwestern Medicine and Weill Cornell Medicine, and conducted results verification (n = 100) to determine precision and recall. We additionally executed the same phenotype definition against two different data platforms, OHDSI and Fast Healthcare Interoperability Resources (FHIR), using the same underlying dataset and compared the results. Results CQL is expressive enough to represent the HF phenotype definition, including Boolean and aggregate operators, and temporal relationships between data elements. The language design also enabled the implementation of a custom execution engine with relative ease, and results verification at both sites revealed that precision and recall were both 100%. Cross-platform execution resulted in identical patient cohorts generated by both data platforms. Conclusions CQL supports the representation of arbitrarily complex phenotype definitions, and our execution engine implementation demonstrated cross-platform execution against two widely used clinical data platforms. The language thus has the potential to help address current limitations with portability in EHR-driven phenotyping and scale in learning health systems.
Quantitative disease risk scores from EHR with applications to clinical risk stratification and genetic studies

Danqing Xu,Chen Wang,Atlas Khan,Ning Shang,Zihuai He,Adam Gordon,Iftikhar J. Kullo,Shawn Murphy,Yizhao Ni,Wei-Qi Wei,Ali Gharavi,Krzysztof Kiryluk,Chunhua Weng,Iuliana Ionita-Laza

DOI: https://doi.org/10.1038/s41746-021-00488-3

IF: 15.2

2021-07-23

npj Digital Medicine

Abstract:Abstract Labeling clinical data from electronic health records (EHR) in health systems requires extensive knowledge of human expert, and painstaking review by clinicians. Furthermore, existing phenotyping algorithms are not uniformly applied across large datasets and can suffer from inconsistencies in case definitions across different algorithms. We describe here quantitative disease risk scores based on almost unsupervised methods that require minimal input from clinicians, can be applied to large datasets, and alleviate some of the main weaknesses of existing phenotyping algorithms. We show applications to phenotypic data on approximately 100,000 individuals in eMERGE, and focus on several complex diseases, including Chronic Kidney Disease, Coronary Artery Disease, Type 2 Diabetes, Heart Failure, and a few others. We demonstrate that relative to existing approaches, the proposed methods have higher prediction accuracy, can better identify phenotypic features relevant to the disease under consideration, can perform better at clinical risk stratification, and can identify undiagnosed cases based on phenotypic features available in the EHR. Using genetic data from the eMERGE-seq panel that includes sequencing data for 109 genes on 21,363 individuals from multiple ethnicities, we also show how the new quantitative disease risk scores help improve the power of genetic association studies relative to the standard use of disease phenotypes. The results demonstrate the effectiveness of quantitative disease risk scores derived from rich phenotypic EHR databases to provide a more meaningful characterization of clinical risk for diseases of interest beyond the prevalent binary (case-control) classification.

health care sciences & services,medical informatics
PheProb: probabilistic phenotyping using diagnosis codes to improve power for genetic association studies.

Jennifer A. Sinnott,Fiona Cai,Sheng Yu,Boris P. Hejblum,Chuan Hong,Isaac S. Kohane,Katherine P. Liao

DOI: https://doi.org/10.1093/jamia/ocy056

2018-01-01

Journal of the American Medical Informatics Association

Abstract:Objective: Standard approaches for large scale phenotypic screens using electronic health record (EHR) data apply thresholds, such as >= 2 diagnosis codes, to define subjects as having a phenotype. However, the variation in the accuracy of diagnosis codes can impair the power of such screens. Our objective was to develop and evaluate an approach which converts diagnosis codes into a probability of a phenotype (PheProb). We hypothesized that this alternate approach for defining phenotypes would improve power for genetic association studies. Methods: The PheProb approach employs unsupervised clustering to separate patients into 2 groups based on diagnosis codes. Subjects are assigned a probability of having the phenotype based on the number of diagnosis codes. This approach was developed using simulated EHR data and tested in a real world EHR cohort. In the latter, we tested the association between low density lipoprotein cholesterol (LDL-C) genetic risk alleles known for association with hyperlipidemia and hyperlipidemia codes (ICD-9 272.x). PheProb and thresholding approaches were compared. Results: Among n = 1462 subjects in the real world EHR cohort, the threshold-based p-values for association between the genetic risk score (GRS) and hyperlipidemia were 0.126 (>= 1 code), 0.123 (>= 2 codes), and 0.142 (>= 3 codes). The PheProb approach produced the expected significant association between the GRS and hyperlipidemia: p = .001. Conclusions: PheProb improves statistical power for association studies relative to standard thresholding approaches by leveraging information about the phenotype in the billing code counts. The PheProb approach has direct applications where efficient approaches are required, such as in Phenome-Wide Association Studies.
Impact of Diverse Data Sources on Computational Phenotyping

Liwei Wang,Janet E. Olson,Suzette J. Bielinski,Jennifer L. St. Sauver,Sunyang Fu,Huan He,Mine S. Cicek,Matthew A. Hathcock,James R. Cerhan,Hongfang Liu

DOI: https://doi.org/10.3389/fgene.2020.00556

IF: 3.7

2020-06-03

Frontiers in Genetics

Abstract:Electronic health records (EHRs) are widely adopted with a great potential to serve as a rich, integrated source of phenotype information. Computational phenotyping, which extracts phenotypes from EHR data automatically, can accelerate the adoption and utilization of phenotype-driven efforts to advance scientific discovery and improve healthcare delivery. A list of computational phenotyping algorithms has been published but data fragmentation, i.e., incomplete data within one single data source, has been raised as an inherent limitation of computational phenotyping. In this study, we investigated the impact of diverse data sources on two published computational phenotyping algorithms, rheumatoid arthritis (RA) and type 2 diabetes mellitus (T2DM), using Mayo EHRs and Rochester Epidemiology Project (REP) which links medical records from multiple health care systems. Results showed that both RA (less prevalent) and T2DM (more prevalent) case selections were markedly impacted by data fragmentation, with positive predictive value (PPV) of 91.4 and 92.4%, false-negative rate (FNR) of 26.6 and 14% in Mayo data, respectively, PPV of 97.2 and 98.3%, FNR of 5.2 and 3.3% in REP. T2DM controls also contain biases, with PPV of 91.2% and FNR of 1.2% for Mayo. We further elaborated underlying reasons impacting the performance.

genetics & heredity
A Semiparametric Approach for Robust and Efficient Learning with Biobank Data

Molei Liu,Xinyi Wang,Chuan Hong

2024-04-01

Abstract:With the increasing availability of electronic health records (EHR) linked with biobank data for translational research, a critical step in realizing its potential is to accurately classify phenotypes for patients. Existing approaches to achieve this goal are based on error-prone EHR surrogate outcomes, assisted and validated by a small set of labels obtained via medical chart review, which may also be subject to misclassification. Ignoring the noise in these outcomes can induce severe estimation and validation bias to both EHR phenotyping and risking modeling with biomarkers collected in the biobank. To overcome this challenge, we propose a novel unsupervised and semiparametric approach to jointly model multiple noisy EHR outcomes with their linked biobank features. Our approach primarily aims at disease risk modeling with the baseline biomarkers, and is also able to produce a predictive EHR phenotyping model and validate its performance without observations of the true disease outcome. It consists of composite and nonparametric regression steps free of any parametric model specification, followed by a parametric projection step to reduce the uncertainty and improve the estimation efficiency. We show that our method is robust to violations of the parametric assumptions while attaining the desirable root-$n$ convergence rates on risk modeling. Our developed method outperforms existing methods in extensive simulation studies, as well as a real-world application in phenotyping and genetic risk modeling of type II diabetes.

Methodology
A Quantitative Bias Analysis Approach to Informative Presence Bias in Electronic Health Records

Hanxi Zhang,Amy S. Clark,Rebecca A. Hubbard

DOI: https://doi.org/10.1097/ede.0000000000001714

2024-04-18

Epidemiology

Abstract:Accurate outcome and exposure ascertainment in electronic health record (EHR) data, referred to as EHR phenotyping, relies on the completeness and accuracy of EHR data for each individual. However, some individuals, such as those with a greater comorbidity burden, visit the health care system more frequently and thus have more complete data, compared with others. Ignoring such dependence of exposure and outcome misclassification on visit frequency can bias estimates of associations in EHR analysis. We developed a framework for describing the structure of outcome and exposure misclassification due to informative visit processes in EHR data and assessed the utility of a quantitative bias analysis approach to adjusting for bias induced by informative visit patterns. Using simulations, we found that this method produced unbiased estimates across all informative visit structures, if the phenotype sensitivity and specificity were correctly specified. We applied this method in an example where the association between diabetes and progression-free survival in metastatic breast cancer patients may be subject to informative presence bias. The quantitative bias analysis approach allowed us to evaluate robustness of results to informative presence bias and indicated that findings were unlikely to change across a range of plausible values for phenotype sensitivity and specificity. Researchers using EHR data should carefully consider the informative visit structure reflected in their data and use appropriate approaches such as the quantitative bias analysis approach described here to evaluate robustness of study findings.

public, environmental & occupational health
Desiderata for Computable Representations of Electronic Health Records-Driven Phenotype Algorithms

Huan Mo,William K. Thompson,Luke V. Rasmussen,Jennifer A. Pacheco,Guoqian Jiang,Richard Kiefer,Qian Zhu,Jie Xu,Enid Montague,David S. Carrell,Todd Lingren,Frank D. Mentch,Yizhao Ni,Firas H. Wehbe,Peggy L. Peissig,Gerard Tromp,Eric B. Larson,Christopher G. Chute,Jyotishman Pathak,Joshua C. Denny,Peter Speltz,Abel N. Kho,Gail P. Jarvik,Cosmin A. Bejan,Marc S. Williams,Kenneth Borthwick,Terrie E. Kitchner,Dan M. Roden,Paul A. Harris

DOI: https://doi.org/10.1093/jamia/ocv112

2015-01-01

Journal of the American Medical Informatics Association

Abstract:Background Electronic health records (EHRs) are increasingly used for clinical and translational research through the creation of phenotype algorithms. Currently, phenotype algorithms are most commonly represented as noncomputable descriptive documents and knowledge artifacts that detail the protocols for querying diagnoses, symptoms, procedures, medications, and/or text-driven medical concepts, and are primarily meant for human comprehension. We present desiderata for developing a computable phenotype representation model (PheRM).Methods A team of clinicians and informaticians reviewed common features for multisite phenotype algorithms published in PheKB.org and existing phenotype representation platforms. We also evaluated well-known diagnostic criteria and clinical decision-making guidelines to encompass a broader category of algorithms.Results We propose 10 desired characteristics for a flexible, computable PheRM: (1) structure clinical data into queryable forms; (2) recommend use of a common data model, but also support customization for the variability and availability of EHR data among sites; (3) support both human-readable and computable representations of phenotype algorithms; (4) implement set operations and relational algebra for modeling phenotype algorithms; (5) represent phenotype criteria with structured rules; (6) support defining temporal relations between events; (7) use standardized terminologies and ontologies, and facilitate reuse of value sets; (8) define representations for text searching and natural language processing; (9) provide interfaces for external software algorithms; and (10) maintain backward compatibility.Conclusion A computable PheRM is needed for true phenotype portability and reliability across different EHR products and healthcare systems. These desiderata are a guide to inform the establishment and evolution of EHR phenotype algorithm authoring platforms and languages.
Reducing Information and Selection Bias in EHR-Linked Biobanks via Genetics-Informed Multiple Imputation and Sample Weighting

Maxwell Salvatore,Ritoban Kundu,Jiacong Du,Christopher R. Friese,Alison M Mondul,David A Hanauer,Haidong Lu,Celeste Leigh Pearce,Bhramar Mukherjee

DOI: https://doi.org/10.1101/2024.10.28.24316286

2024-10-29

Abstract:Electronic health records (EHRs) are valuable for public health and clinical research but are prone to many sources of bias, including missing data and non-probability selection. Missing data in EHRs is complex due to potential non-recording, fragmentation, or clinically informative absences. This study explores whether polygenic risk score (PRS)-informed multiple imputation for missing traits, combined with sample weighting, can mitigate missing data and selection biases in estimating disease-exposure associations. Simulations were conducted for missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) conditions under different sampling mechanisms. PRS-informed multiple imputation showed generally lower bias, particularly when combined with sample weighting. For example, in biased samples of 10,000 with exposure and outcome MAR data, PRS-informed imputation had lower percent bias (3.8%) and better coverage rate (0.883) compared to PRS-uninformed (4.5%; 0.877) and complete case analyses (10.3%; 0.784) in covariate-adjusted, weighted, multiple imputation scenarios. In a case study using Michigan Genomics Initiative (n=50,026) data, PRS-informed imputation aligned more closely with a sample-weighted All of Us-derived benchmark than analyses ignoring missing data and selection bias. Researchers should consider leveraging genetic data and sample weighting to address biases from missing data and non-probability sampling in biobanks.

Epidemiology
Automated feature selection of predictors in electronic medical records data

Jessica Gronsbell,Jessica Minnier,Sheng Yu,Katherine Liao,Tianxi Cai

DOI: https://doi.org/10.1111/biom.12987

IF: 1.701

Biometrics

Abstract:The use of Electronic Health Records (EHR) for translational research can be challenging due to difficulty in extracting accurate disease phenotype data. Historically, EHR algorithms for annotating phenotypes have been either rule-based or trained with billing codes and gold standard labels curated via labor intensive medical chart review. These simplistic algorithms tend to have unpredictable portability across institutions and low accuracy for many disease phenotypes due to imprecise billing codes. Recently, more sophisticated machine learning algorithms have been developed to improve the robustness and accuracy of EHR phenotyping algorithms. These algorithms are typically trained via supervised learning, relating gold standard labels to a wide range of candidate features including billing codes, procedure codes, medication prescriptions and relevant clinical concepts extracted from narrative notes via Natural Language Processing (NLP). However, due to the time intensiveness of gold standard labeling, the size of the training set is often insufficient to build a generalizable algorithm with the large number of candidate features extracted from EHR. To reduce the number of candidate predictors and in turn improve model performance, we present an automated feature selection method based entirely on unlabeled observations. The proposed method generates a comprehensive surrogate for the underlying phenotype with an unsupervised clustering of disease status based on several highly predictive features such as diagnosis codes and mentions of the disease in text fields available in the entire set of EHR data. A sparse regression model is then built with the estimated outcomes and remaining covariates to identify those features most informative of the phenotype of interest. Relying on the results of Li and Duan (1989), we demonstrate that variable selection for the underlying phenotype model can be achieved by fitting the surrogate-based model. We explore the performance of our methods in numerical simulations and present the results of a prediction model for Rheumatoid Arthritis (RA) built on a large EHR data mart from the Partners Health System consisting of billing codes and NLP terms. Empirical results suggest that our procedure reduces the number of gold-standard labels necessary for phenotyping thereby harnessing the automated power of EHR data and improving efficiency.
Electronic Health Record Phenotyping with Internally Assessable Performance (PhIAP) using Anchor-Positive and Unlabeled Patients

Lingjiao Zhang,Xiruo Ding,Yanyuan Ma,Naveen Muthu,Imran Ajmal,Jason H. Moore,Daniel S. Herman,Jinbo Chen

DOI: https://doi.org/10.48550/arXiv.1902.10060

2019-01-30

Applications

Abstract:Building phenotype models using electronic health record (EHR) data conventionally requires manually labeled cases and controls. Assigning labels is labor intensive and, for some phenotypes, identifying gold-standard controls is prohibitive. To facilitate comprehensive clinical decision support and research, we sought to develop an accurate EHR phenotyping approach that assesses its performance without a validation set. Our framework relies on specifying a random subset of cases, potentially using an anchor variable that has excellent positive predictive value and sensitivity that is independent of predictors. We developed a novel maximum likelihood approach that efficiently leverages data from anchor-positive and unlabeled patients to develop logistic regression phenotyping models. Additionally, we described novel statistical methods for estimating phenotyping prevalence and assessing model calibration and predictive performance measures. Theoretical and simulation studies indicated our method generates accurate predicted probabilities, leading to excellent discrimination and calibration, and consistent estimates of phenotype prevalence and anchor sensitivity. The method appeared robust to minor lack-of-fit and the proposed calibration assessment detected major lack-of-fit. We applied our method to EHR data to develop a preliminary model for identifying patients with primary aldosteronism, which achieved an AUC of 0.99 and PPV of 0.8. We developed novel statistical methods for accurate model development and validation with minimal manual labeling, facilitating development of scalable, transferable, semi-automated case labeling and practice-specific models. Our EHR phenotyping approach decreases labor-intensive manual phenotyping and annotation, which should enable broader model development and dissemination for EHR clinical decision support and research.
Generative Programming: A Model Driven Approach

R. Azimi,Farid Hosseini

Abstract:ions to implementation components. In GP techniques, programmers define a model of the system that just defines the data contents and logic of the applications. Then programs similar to compilers translate such models to the actual programs or machine codes automatically. As of traditional compilers, there can be several model compilers that translate a single model to various platforms, depending on the needs of different users (Figure 1). Figure 1 Trend in raising the level of abstraction in computer programming. In order to achieve portability, a model must be abstract. That means it should not include any platform-dependent information. Instead, it defines specification of the system that is inherent to the system independent from the way it can be implemented. Also, a model has to be both complete and precise. It may not ignore any important details of the application data content and flow and events logic. Moreover, it should leave no ambiguity in the definitions of the data items and system behavior. Finally, translation of a model to the conventional software development platforms must be possible most of the time. In some possibly rare cases in which such translation is not computable, application programmers have to write the programs in the current programming languages, and link such components with the rest of the system that is automatically generated. In this report, we provide a overview of techniques, tools, and methodologies that are used in building systems using GP. The focus of the report is more on model-driven GP. The rest of this report is organized as follows. We first provide a closer look at the current GP techniques. Then, describe the desired characteristics of a model for building general-purpose applications. Then, we review Executable UML as an example of such model and analyze its capabilities and weaknesses. Next, we describe the requirements for building model compilers, and analyze the existing tools. Finally, we describe techniques that use GP in building middleware software in particular. 2. GP: A CLOSER LOOK A widely accepted definition for GP is that it is the automatic selection, customization, and assembly of prebuilt templates and components on demand. This process is in contrast with the current practice of manually searching, adopting, and integrating components [3]. There are some skepticisms about that GP will not widely be used since it requires skills that are not mainstream, which could be true, in one sense. Actually, most developers prefer not to use GP in their design and implementation since they want control on the development process. Probably, we need to wait for a few years to see whether many vendors move to GP software engineering. However, there are clear incentives in using techniques such as GP. As hardware and network technologies become faster and cheaper at a predictable pace, software, particularly distributed software, is becoming slower, buggier, and more expensive and it is hard to predict how long it takes to build. A key reason for these different trends is that hardware and networks are now heavily built based on components off the shelf (COTS). While, as for software, current practice is to build the custom components for each product specifically. The interface of hardware COTS are usually standard, whereas distributed software components are usually built in a custom manner, or follow standards that are specific to a company. Applying COTS standards to distributed software is not easy, and many tough R&D issues must be resolved [4]. GP 1 Due to our time limits, such an overview is neither complete nor accurate. However, we tried to cover some of the most techniques and technologies that are around. Too hard to program and debug Too machinespecific Still hard to program Still platform-specific Hard to reuse Easy to use Platform independent "! #$! % &(' )+*+) , -/. 0 1 # 2 3 +452 ! -6% Compiler Model Compiler
Prior Adaptive Semi-supervised Learning with Application to EHR Phenotyping

Yichi Zhang,Molei Liu,Matey Neykov,Tianxi Cai

Abstract:Electronic Health Record (EHR) data, a rich source for biomedical research, have been successfully used to gain novel insight into a wide range of diseases. Despite its potential, EHR is currently underutilized for discovery research due to its major limitation in the lack of precise phenotype information. To overcome such difficulties, recent efforts have been devoted to developing supervised algorithms to accurately predict phenotypes based on relatively small training datasets with gold standard labels extracted via chart review. However, supervised methods typically require a sizable training set to yield generalizable algorithms, especially when the number of candidate features, p, is large. In this paper, we propose a semi-supervised (SS) EHR phenotyping method that borrows information from both a small, labeled dataset (where both the label Y and the feature set X are observed) and a much larger, weakly-labeled dataset in which the feature set X is accompanied only by a surrogate label S that is available to all patients. Under a working prior assumption that S is related to X only through Y and allowing it to hold approximately, we propose a prior adaptive semi-supervised (PASS) estimator that incorporates the prior knowledge by shrinking the estimator towards a direction derived under the prior. We derive asymptotic theory for the proposed estimator and justify its efficiency and robustness to prior information of poor quality. We also demonstrate its superiority over existing estimators under various scenarios via simulation studies and on three real-world EHR phenotyping studies at a large tertiary hospital.
Surrogate-assisted Feature Extraction for High-Throughput Phenotyping.

Sheng Yu,Abhishek Chakrabortty,Katherine P. Liao,Tianrun Cai,Ashwin N. Ananthakrishnan,Vivian S. Gainer,Susanne E. Churchill,Peter Szolovits,Shawn N. Murphy,Isaac S. Kohane,Tianxi Cai

DOI: https://doi.org/10.1093/jamia/ocw135

2016-01-01

Journal of the American Medical Informatics Association

Abstract:OBJECTIVE:Phenotyping algorithms are capable of accurately identifying patients with specific phenotypes from within electronic medical records systems. However, developing phenotyping algorithms in a scalable way remains a challenge due to the extensive human resources required. This paper introduces a high-throughput unsupervised feature selection method, which improves the robustness and scalability of electronic medical record phenotyping without compromising its accuracy.METHODS:The proposed Surrogate-Assisted Feature Extraction (SAFE) method selects candidate features from a pool of comprehensive medical concepts found in publicly available knowledge sources. The target phenotype's International Classification of Diseases, Ninth Revision and natural language processing counts, acting as noisy surrogates to the gold-standard labels, are used to create silver-standard labels. Candidate features highly predictive of the silver-standard labels are selected as the final features.RESULTS:Algorithms were trained to identify patients with coronary artery disease, rheumatoid arthritis, Crohn's disease, and ulcerative colitis using various numbers of labels to compare the performance of features selected by SAFE, a previously published automated feature extraction for phenotyping procedure, and domain experts. The out-of-sample area under the receiver operating characteristic curve and F -score from SAFE algorithms were remarkably higher than those from the other two, especially at small label sizes.CONCLUSION:SAFE advances high-throughput phenotyping methods by automatically selecting a succinct set of informative features for algorithm training, which in turn reduces overfitting and the needed number of gold-standard labels. SAFE also potentially identifies important features missed by automated feature extraction for phenotyping or experts.
Enabling scalable clinical interpretation of ML-based phenotypes using real world data

Owen Parsons,Nathan E Barlow,Janie Baxter,Karen Paraschin,Andrea Derix,Peter Hein,Robert Dürichen

DOI: https://doi.org/10.48550/arXiv.2208.01607

2022-08-03

Abstract:The availability of large and deep electronic healthcare records (EHR) datasets has the potential to enable a better understanding of real-world patient journeys, and to identify novel subgroups of patients. ML-based aggregation of EHR data is mostly tool-driven, i.e., building on available or newly developed methods. However, these methods, their input requirements, and, importantly, resulting output are frequently difficult to interpret, especially without in-depth data science or statistical training. This endangers the final step of analysis where an actionable and clinically meaningful interpretation is <a class="link-external link-http" href="http://needed.This" rel="external noopener nofollow">this http URL</a> study investigates approaches to perform patient stratification analysis at scale using large EHR datasets and multiple clustering methods for clinical research. We have developed several tools to facilitate the clinical evaluation and interpretation of unsupervised patient stratification results, namely pattern screening, meta clustering, surrogate modeling, and curation. These tools can be used at different stages within the analysis. As compared to a standard analysis approach, we demonstrate the ability to condense results and optimize analysis time. In the case of meta clustering, we demonstrate that the number of patient clusters can be reduced from 72 to 3 in one example. In another stratification result, by using surrogate models, we could quickly identify that heart failure patients were stratified if blood sodium measurements were available. As this is a routine measurement performed for all patients with heart failure, this indicated a data bias. By using further cohort and feature curation, these patients and other irrelevant features could be removed to increase the clinical meaningfulness. These examples show the effectiveness of the proposed methods and we hope to encourage further research in this field.

Machine Learning,Information Retrieval
Combining billing codes, clinical notes, and medications from electronic health records provides superior phenotyping performance

Wei-Qi Wei,Pedro L Teixeira,Huan Mo,Robert M Cronin,Jeremy L Warner,Joshua C Denny

DOI: https://doi.org/10.1093/jamia/ocv130

2015-09-02

Journal of the American Medical Informatics Association

Abstract:Abstract Objective To evaluate the phenotyping performance of three major electronic health record (EHR) components: International Classification of Disease (ICD) diagnosis codes, primary notes, and specific medications. Materials and Methods We conducted the evaluation using de-identified Vanderbilt EHR data. We preselected ten diseases: atrial fibrillation, Alzheimer’s disease, breast cancer, gout, human immunodeficiency virus infection, multiple sclerosis, Parkinson’s disease, rheumatoid arthritis, and types 1 and 2 diabetes mellitus. For each disease, patients were classified into seven categories based on the presence of evidence in diagnosis codes, primary notes, and specific medications. Twenty-five patients per disease category (a total number of 175 patients for each disease, 1750 patients for all ten diseases) were randomly selected for manual chart review. Review results were used to estimate the positive predictive value (PPV), sensitivity, and F -score for each EHR component alone and in combination. Results The PPVs of single components were inconsistent and inadequate for accurately phenotyping (0.06–0.71). Using two or more ICD codes improved the average PPV to 0.84. We observed a more stable and higher accuracy when using at least two components (mean ± standard deviation: 0.91 ± 0.08). Primary notes offered the best sensitivity (0.77). The sensitivity of ICD codes was 0.67. Again, two or more components provided a reasonably high and stable sensitivity (0.59 ± 0.16). Overall, the best performance ( F score: 0.70 ± 0.12) was achieved by using two or more components. Although the overall performance of using ICD codes (0.67 ± 0.14) was only slightly lower than using two or more components, its PPV (0.71 ± 0.13) is substantially worse (0.91 ± 0.08). Conclusion Multiple EHR components provide a more consistent and higher performance than a single one for the selected phenotypes. We suggest considering multiple EHR components for future phenotyping design in order to obtain an ideal result.

information science & library science,computer science, information systems, interdisciplinary applications,health care sciences & services,medical informatics
A framework for employing longitudinally collected multicenter electronic health records to stratify heterogeneous patient populations on disease history

Marc P Maurits,Ilya Korsunsky,Soumya Raychaudhuri,Shawn N Murphy,Jordan W Smoller,Scott T Weiss,Thomas W J Huizinga,Marcel J T Reinders,Elizabeth W Karlson,Erik B van den Akker,Rachel Knevel

DOI: https://doi.org/10.1093/jamia/ocac008

2022-04-13

Abstract:Objective: To facilitate patient disease subset and risk factor identification by constructing a pipeline which is generalizable, provides easily interpretable results, and allows replication by overcoming electronic health records (EHRs) batch effects. Material and methods: We used 1872 billing codes in EHRs of 102 880 patients from 12 healthcare systems. Using tools borrowed from single-cell omics, we mitigated center-specific batch effects and performed clustering to identify patients with highly similar medical history patterns across the various centers. Our visualization method (PheSpec) depicts the phenotypic profile of clusters, applies a novel filtering of noninformative codes (Ranked Scope Pervasion), and indicates the most distinguishing features. Results: We observed 114 clinically meaningful profiles, for example, linking prostate hyperplasia with cancer and diabetes with cardiovascular problems and grouping pediatric developmental disorders. Our framework identified disease subsets, exemplified by 6 "other headache" clusters, where phenotypic profiles suggested different underlying mechanisms: migraine, convulsion, injury, eye problems, joint pain, and pituitary gland disorders. Phenotypic patterns replicated well, with high correlations of ≥0.75 to an average of 6 (2-8) of the 12 different cohorts, demonstrating the consistency with which our method discovers disease history profiles. Discussion: Costly clinical research ventures should be based on solid hypotheses. We repurpose methods from single-cell omics to build these hypotheses from observational EHR data, distilling useful information from complex data. Conclusion: We establish a generalizable pipeline for the identification and replication of clinically meaningful (sub)phenotypes from widely available high-dimensional billing codes. This approach overcomes datatype problems and produces comprehensive visualizations of validation-ready phenotypes.
Lossless integration of multiple electronic health records for identifying pleiotropy using summary statistics

Ruowang Li,Rui Duan,Xinyuan Zhang,Thomas Lumley,Sarah Pendergrass,Christopher Bauer,Hakon Hakonarson,David S. Carrell,Jordan W. Smoller,Wei-Qi Wei,Robert Carroll,Digna R. Velez Edwards,Georgia Wiesner,Patrick Sleiman,Josh C. Denny,Jonathan D. Mosley,Marylyn D. Ritchie,Yong Chen,Jason H. Moore

DOI: https://doi.org/10.1038/s41467-020-20211-2

IF: 16.6

2021-01-08

Nature Communications

Abstract:Abstract Increasingly, clinical phenotypes with matched genetic data from bio-bank linked electronic health records (EHRs) have been used for pleiotropy analyses. Thus far, pleiotropy analysis using individual-level EHR data has been limited to data from one site. However, it is desirable to integrate EHR data from multiple sites to improve the detection power and generalizability of the results. Due to privacy concerns, individual-level patients’ data are not easily shared across institutions. As a result, we introduce Sum-Share, a method designed to efficiently integrate EHR and genetic data from multiple sites to perform pleiotropy analysis. Sum-Share requires only summary-level data and one round of communication from each site, yet it produces identical test statistics compared with that of pooled individual-level data. Consequently, Sum-Share can achieve lossless integration of multiple datasets. Using real EHR data from eMERGE, Sum-Share is able to identify 1734 potential pleiotropic SNPs for five cardiovascular diseases.

multidisciplinary sciences
Feature Extraction for Phenotyping from Semantic and Knowledge Resources

Wenxin Ning,Stephanie Chan,Andrew Beam,Ming Yu,Alon Geva,Katherine Liao,Mary Mullen,Kenneth D. Mandl,Isaac Kohane,Tianxi Cai,Sheng Yu

DOI: https://doi.org/10.1016/j.jbi.2019.103122

IF: 8

2019-01-01

Journal of Biomedical Informatics

Abstract:OBJECTIVE:Phenotyping algorithms can efficiently and accurately identify patients with a specific disease phenotype and construct electronic health records (EHR)-based cohorts for subsequent clinical or genomic studies. Previous studies have introduced unsupervised EHR-based feature selection methods that yielded algorithms with high accuracy. However, those selection methods still require expert intervention to tweak the parameter settings according to the EHR data distribution for each phenotype. To further accelerate the development of phenotyping algorithms, we propose a fully automated and robust unsupervised feature selection method that leverages only publicly available medical knowledge sources, instead of EHR data.METHODS:SEmantics-Driven Feature Extraction (SEDFE) collects medical concepts from online knowledge sources as candidate features and gives them vector-form distributional semantic representations derived with neural word embedding and the Unified Medical Language System Metathesaurus. A number of features that are semantically closest and that sufficiently characterize the target phenotype are determined by a linear decomposition criterion and are selected for the final classification algorithm.RESULTS:SEDFE was compared with the EHR-based SAFE algorithm and domain experts on feature selection for the classification of five phenotypes including coronary artery disease, rheumatoid arthritis, Crohn's disease, ulcerative colitis, and pediatric pulmonary arterial hypertension using both supervised and unsupervised approaches. Algorithms yielded by SEDFE achieved comparable accuracy to those yielded by SAFE and expert-curated features. SEDFE is also robust to the input semantic vectors.CONCLUSION:SEDFE attains satisfying performance in unsupervised feature selection for EHR phenotyping. Both fully automated and EHR-independent, this method promises efficiency and accuracy in developing algorithms for high-throughput phenotyping.
Oxidative Stress Genes, Antioxidants and Coronary Artery Disease in  Type 2 Diabetes Mellitus

Miha Tibaut,D. Petrovič

DOI: https://doi.org/10.2174/1871525714666160407143416

2016-03-31

Cardiovascular & Hematological Agents in Medicinal Chemistry

Abstract:The worldwide increasing prevalence of obesity and sedentary lifestyle is the main cause of the rising incidence of T2DM. Due to chronic macrovascular and microvascular complications, T2DM represent a huge socioeconomic burden in the world. Oxidative stress is a key pathogenic mechanism implicated in diabetic coronary artery disease (CAD). Polymorphisms of oxidative stress genes are known to influence oxidative stress levels and are therefore thought to impact CAD pathogenesis. Identifying higher risk groups would be rational, since it would allow better sample selection and thus better results in antioxidant trials. In this review, we summarize the evidence of oxidative stress gene polymorphisms related to the pathogenesis of CAD. Moreover, we provide a review of antioxidants tested in subjects with CAD.

Cardiac manifestations of ulcerative colitis.

A Robust Phenotype-Driven Likelihood Ratio Analysis Approach Assisting Interpretable Clinical Diagnosis of Rare Diseases.

Towardcross-Platformelectronic Health Record-Drivenphenotyping Using Clinical Quality Language

Quantitative disease risk scores from EHR with applications to clinical risk stratification and genetic studies

PheProb: probabilistic phenotyping using diagnosis codes to improve power for genetic association studies.

Impact of Diverse Data Sources on Computational Phenotyping

A Semiparametric Approach for Robust and Efficient Learning with Biobank Data

A Quantitative Bias Analysis Approach to Informative Presence Bias in Electronic Health Records

Desiderata for Computable Representations of Electronic Health Records-Driven Phenotype Algorithms

Reducing Information and Selection Bias in EHR-Linked Biobanks via Genetics-Informed Multiple Imputation and Sample Weighting

Automated feature selection of predictors in electronic medical records data

Electronic Health Record Phenotyping with Internally Assessable Performance (PhIAP) using Anchor-Positive and Unlabeled Patients

Generative Programming: A Model Driven Approach

Prior Adaptive Semi-supervised Learning with Application to EHR Phenotyping

Surrogate-assisted Feature Extraction for High-Throughput Phenotyping.

Enabling scalable clinical interpretation of ML-based phenotypes using real world data

Combining billing codes, clinical notes, and medications from electronic health records provides superior phenotyping performance

A framework for employing longitudinally collected multicenter electronic health records to stratify heterogeneous patient populations on disease history

Lossless integration of multiple electronic health records for identifying pleiotropy using summary statistics

Feature Extraction for Phenotyping from Semantic and Knowledge Resources

Oxidative Stress Genes, Antioxidants and Coronary Artery Disease in Type 2 Diabetes Mellitus

Oxidative Stress Genes, Antioxidants and Coronary Artery Disease in  Type 2 Diabetes Mellitus