Predicting Who Dies Depends on How Severity Is Measured: Implications for Evaluating Patient Outcomes
L. Iezzoni,A. Ash,M. Shwartz,J. Daley,J. Hughes,Y. Mackiernan
DOI: https://doi.org/10.7326/0003-4819-123-10-199511150-00004
IF: 39.2
1995-11-15
Annals of Internal Medicine
Abstract:Hospital and physician performance is increasingly scrutinized by organizations ranging from state governments to managed care payers to local business coalitions [1-7]. Hospitals and medical practices also monitor their own results to identify areas in which they can produce improvement and savings. Performance profiles of health care providers often compare patient outcomes, such as death rates; comparing such outcomes across hospitals or physicians generally requires adjustment for patient risk. Risk adjustment recognizes that the underlying nature of some patients' diseases makes those patients more likely than others to have poor outcomes [8, 9]. More than a dozen risk-adjustment tools, often called severity measures, have been created specifically to address health care administration and policy concerns [1-7, 10-12]. Unlike clinical measures of risk, which can incorporate such factors as disease-specific clinical findings, complexity of comorbid illness, and functional status [13], severity measures rate patients on the basis of limited dataeither computerized hospital discharge abstracts [14, 15] or information gathered from medical records by using abstraction protocols independent of specific diseases [16-18]. These methods generally focus on predicting hospital resource consumption or in-patient death. They are frequently proprietary, and their complete logic is often unavailable for scrutiny. Severity measures are now marketed widely to hospitals, payers, business leaders, and governments. Some states (Pennsylvania, Iowa, Colorado, and Florida, for example), regions (such as Cleveland and Orlando), and payers produce comparative performance reports of health care providers by using particular severity measures [1-5]. Important decisions are increasingly made on the basis of severity-adjusted patient outcomes. For example, since 1986, Pennsylvania has required hospitals to produce severity information using MedisGroups. Payers have used MedisGroups-based reports to select health care providers for managed care networks [5]. Pennsylvania's consumer guide [19], which compares hospital death rates and average charges for coronary artery bypass graft surgery, was quoted by President Clinton in his 22 September 1993 health care reform address to the United States Congress [20]: We have evidence that more efficient delivery of health care doesn't decrease quality. Pennsylvania discovered that patients who were charged $21,000 for [coronary bypass] surgery received as good or better care [based on MedisGroups severity-adjusted death rates] as patients who were charged $84,000 for the same procedure in the same state. High prices simply don't always equal good quality. Despite the potential effects of severity measures, relatively little independent information is available about them [21]. Because they are used to evaluate hospitals and physicians, physicians must assess them, especially with respect to their clinical credibility. In this article, we focus on predicting in-hospital death using four severity measures, and we ask three major questions: 1) How well do severity measures predict in-hospital death? 2) Do different severity measures predict different likelihoods of death for the same patients? and 3) If so, what are the clinical characteristics of patients for whom very different likelihoods of death are predicted by different severity measures? Methods Severity Measures We considered four severity measures (Table 1): the admission MedisGroups score [18]; a physiology score patterned after the acute physiology score of the Acute Physiology and Chronic Health Evaluation, third version (APACHE III) [22, 23]; Disease Staging's scale predicting probability of in-hospital death [24-26]; and All Patient Refined Diagnosis Related Groups (APR-DRGs) [27]. These systems are among the most prominent approaches used to adjust outcomes data for severity so that they can be used for state or regional comparisons across hospitals [1-5] and for hospital activities such as internal monitoring, negotiation of managed care contracts, and physician profiling. Table 1. Four Severity Measures* Each measure defines severity in ways that reflect that measure's goals, assigning either numerical severity scores or values on a continuous scale (Table 1). Disease Staging and APR-DRGs use data from standard hospital-discharge abstracts [14, 15], including patient age, patient sex, and diagnoses and procedures coded using the International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM). A discharge abstract contains codes for all diagnoses treated during a particular hospitalization, regardless of when the diagnoses were made. MedisGroups and the physiology score use clinical data abstracted from medical records for only the first 2 days of a hospitalization. Although the APR-DRGs measure was not initially developed to predict mortality, it is used for such analyses. For example, Iowa once required larger hospitals to produce MedisGroups data for severity-adjusted performance reports, but it switched in 1984 to using APR-DRGsa less expensive, discharge-abstract-based measure. This change was partially motivated by the perceived high cost of MedisGroups medical record reviews. Other states, such as Florida, also use APR-DRGs to evaluate health care provider performance. Database To assign severity scores to patients, computerized algorithms were applied to data extracted from the 1992 MedisGroups Comparative Database. Briefly, this database contains the clinical information collected using the MedisGroups data-gathering protocol and submitted to MedisGroups' vendor, MediQual Systems, Inc. The 1992 MedisGroups Comparative Database contains information on all discharges made in 1991 from 108 acute care hospitals, which were chosen by MediQual Systems because of good data quality and in order to encompass a range of hospital characteristics. To ensure adequate sample sizes for hospital-level analyses in another study [28], we eliminated eight low-volume institutions (83 patients total). The American Hospital Association annual survey provided information on hospital characteristics. Admission MedisGroups scores were provided by MediQual Systems; scores for other measures had to be assigned. The MedisGroups database contains standard discharge-abstract information, including ICD-9-CM codes for as many as 20 diagnoses and 50 procedures, listed by hospital. It also includes values for key clinical findings from the admission period (generally the first 2 days of hospitalization), abstracted from medical records during MedisGroups reviews [16-18]. We used these clinical findings to create physiology scores patterned after the APACHE III acute physiology score, summing weights specified by APACHE III for each finding (for example, a pulse of 145 beats/min had a weight of 13 points) [22]. We could not replicate exact APACHE III acute physiology scores because complete values for the required 17 physiologic variables were unavailable: MedisGroups truncates data collection in broadly defined normal ranges [29]. Previous research [29] showed that a similarly derived physiology score did well compared with the exact acute physiology scores of the second APACHE version. Vendors scored the data for the two discharge-abstract-based severity measures (Table 1). On the basis of their specifications, vendors were sent computer files containing the required discharge-abstract data extracted from the MedisGroups database. We merged the scored data into a single analytic file with 100% success. Study Sample and Outcome Measure Many internal hospital monitoring programs and external evaluations, such as Pennsylvania's MedisGroups initiative [19], sample patients by diagnosis-related groups. To parallel this approach, we selected all patients in the database who had been hospitalized for medical treatment of a new acute myocardial infarction defined by diagnosis-related groups. We chose acute myocardial infarction because it is a common condition, is treated at most hospitals, and has a relatively high mortality rate. We included patients in diagnosis-related groups 121 (circulatory disorders with acute myocardial infarction and cardiovascular complication, discharged alive), 122 (circulatory disorders with acute myocardial infarction without cardiovascular complication, discharged alive), and 123 (circulatory disorders with acute myocardial infarction, expired). Patients had either a principal or secondary 5-digit ICD-9-CM discharge diagnosis code beginning with 410 and ending with 1 (initial treatment). Our outcome measure was in-hospital death. The MedisGroups data did not contain information on deaths after discharge. Analysis Each severity measure was used to calculate a predicted probability of death for each patient from a multivariable logistic regression model that included the severity score and dummy variables representing a cross-classification of patients by sex and eight age categories (18-44, 45-54, 55-64, 65-69, 70-74, 75-79, 80-84, and more than equals 85 years of age). Severity scores were entered as either continuous or categorical variables (Table 1). For Disease Staging and MedisGroups, we used the logit of the probability as the independent variable in the logistic regression. All analyses were done using the Statistical Analysis System, release 6.08 (SAS Institute, Cary, North Carolina). Severity Measure Performance We used c and R2 statistics as overall assessments of each severity measure's ability to predict individual patient death. The c statistic assesses this ability as follows: When a person who has died and a person who has lived are each chosen at random, c equals the probability that the severity measure predicts a higher likelihood of death for the one who has died [30]. Higher c values indicate better specificity and sensitivity [31, 32]. A c value of 0.5 indicates that the model does