Quantitative Synthesis in Systematic Reviews
J. Lau,John Ioannidis,C. Schmid
DOI: https://doi.org/10.7326/0003-4819-127-9-199711010-00008
IF: 39.2
1997-11-01
Annals of Internal Medicine
Abstract:A quantitative systematic review, or meta-analysis, uses statistical methods to combine the results of multiple studies. Meta-analyses have been done for systematic reviews of therapeutic trials, diagnostic test evaluations, and epidemiologic studies. Although the statistical methods involved may at first appear to be mathematically complex, their purpose is simple: They are trying to answer four basic questions. Are the results of the different studies similar? To the extent that they are similar, what is the best overall estimate? How precise and robust is this estimate? Finally, can dissimilarities be explained? This article provides some guidance in understanding the key technical aspects of the quantitative approach to these questions. We have avoided using equations and statistical notations; interested readers will find implementations of the described methods in the listed references. We focus here on the quantitative synthesis of reports of randomized, controlled, therapeutic trials because far more meta-analyses on therapeutic studies than on other types of studies have been published. For practical reasons, we present a stepwise description of the tasks that are performed when statistical methods are used to combine data. These tasks are 1) deciding whether to combine data and defining what to combine, 2) evaluating the statistical heterogeneity of the data, 3) estimating a common effect, 4) exploring and explaining heterogeneity, 5) assessing the potential for bias, and 6) presenting the results. Deciding Whether To Combine Data and Defining What To Combine By the time one performs a quantitative synthesis, certain decisions should already have been made about the formulation of the question and the selection of included studies. These topics were discussed in two previous articles in this series [1, 2]. Statistical tests cannot compensate for lack of common sense, clinical acumen, and biological plausibility in the design of the protocol of a meta-analysis. Thus, a reader of a systematic review should always address these issues before evaluating the statistical methods that have been used and the results that have been generated. Combining poor-quality data, overly biased data, or data that do not make sense can easily produce unreliable results. The data to be combined in a meta-analysis are usually either binary or continuous. Binary data involve a yes/no categorization (for example, death or survival). Continuous data take a range of values (for example, change in diastolic blood pressure after antihypertensive treatment, measured in mm Hg). When one is comparing groups of patients, binary data can be summarized by using several measures of treatment effect that were discussed earlier in this series [3]. These measures include the risk ratio; the odds ratio; the risk difference; and, when study duration is important, the incidence rate. Another useful clinical measure, the number needed to treat (NNT), is derived from the inverse of the risk difference [3]. Treatment effect measures, such as the risk ratio and the odds ratio, provide an estimate of the relative efficacy of an intervention, whereas the risk difference describes the intervention's absolute benefit. The various measures of treatment effect offer complementary information, and all should be examined [4]. Continuous data can be summarized by the raw mean difference between the treatment and control groups when the treatment effect is measured on the same scale (for example, diastolic blood pressure in mm Hg), by the standardized mean difference when different scales are used to measure the same treatment effect (for example, different pain scales being combined), or by the correlation coefficients between two continuous variables [5]. The standardized mean difference, also called the effect size, is obtained by dividing the difference between the mean in the treatment group and the mean in the control group by the SD in the control group. Evaluating the Statistical Heterogeneity of the Data This step is intended to answer the question, Are the results of the different studies similar (homogeneous)? It is important to answer this question before combining any data. To do this, one must calculate the magnitude of the statistical diversity (heterogeneity) of the treatment effect that exists among the different sets of data. Statistical diversity can be thought of as attributable to one or both of two causes. First, study results can differ because of random sampling error. Even if the true effect is the same in each study, the results of different studies would be expected to vary randomly around the true common fixed effect. This diversity is called the within-study variance. Second, each study may have been drawn from a different population, depending on the particular patients chosen and the interventions and conditions unique to the study. Therefore, even if each study enrolled a large patient sample, the treatment effect would be expected to differ. These differences, called random effects, describe the between-study variation with regard to an overall mean of the effects of all of the studies that could be undertaken. The test most commonly used to assess the statistical significance of between-study heterogeneity is based on the chi-square distribution [6]. It provides a measure of the sum of the squared differences between the results observed and the results expected in each study, under the assumption that each study estimates the same common treatment effect. A large total deviation indicates that a single common treatment effect is unlikely. Any pooled estimate calculated must account for the between-study heterogeneity. In practice, this test has low sensitivity for detecting heterogeneity, and it has been suggested that a liberal significance level, such as 0.1, should be used [6]. Estimating a Common Effect The questions that this step tries to answers are, 1) To the extent that data are similar, what is their best common point estimate of a therapeutic effect, and 2) how precise is this estimate? The mathematical process involved in this step generally involves combining (pooling) the results of different studies into an overall estimate. Compared with the results of individual studies, pooled results can increase statistical power and lead to more precise estimates of treatment effect. Each study is given a weight according to the precision of its results. The rationale is that studies with narrow CIs should be weighted more heavily than studies with greater uncertainty. The precision is generally expressed by the inverse of the variance of the estimate of each study. The variance has two components: the variance of the individual study and the variance between different studies. When the between-study variance is found to be or assumed to be zero, each study is simply weighted by the inverse of its own variance, which is a function of the study size and the number of events in the study. This approach characterizes a fixed-effects model, as exemplified by the Mantel-Haenszel method [7, 8] or the Peto method [9] for dichotomous data. The Peto method has been particularly popular in the past. It has the advantage of simple calculation; however, although it is appropriate in most cases, it may introduce large biases if the data are unbalanced [10, 11]. On the other hand, random-effects models also add the between-study variance to the within-study variance of each individual study when the pooled mean of the random effects is calculated. The random-effects model most commonly used for dichotomous data is the DerSimonian and Laird estimate of the between-study variance [12]. Fixed- and random-effects models for continuous data have also been described [13]. Pooled results are generally reported as a point estimate and CI, typically a 95% CI. Other quantitative techniques for combining data, such as the Confidence Profile Method [14], use Bayesian methods to calculate posterior probability distributions for effects of interest. Bayesian statistics are based on the principle that each observation or set of observations should be viewed in conjunction with a prior probability describing the prior knowledge about the phenomenon of interest [15]. The new observations alter this prior probability to generate a posterior probability. Traditional meta-analysis assumes that nothing is known about the magnitude of the treatment effect before randomized trials are performed. In Bayesian terms, the prior probability distribution is noninformative. Bayesian approaches may also allow the incorporation of indirect evidence in generating prior distributions [14] and may be particularly helpful in situations in which few data from randomized studies exist [16]. Bayesian analyses may also be used to account for the uncertainty introduced by estimating the between-study variance in the random-effects model, leading to more appropriate estimates and predictions of treatment efficacy [17]. Exploring and Explaining Heterogeneity The next important issue is whether the common estimate obtained in the previous step is robust. Sensitivity analyses determine whether the common estimate is influenced by changes in the assumptions and in the protocol for combining the data. A comparison of the results of fixed- and random-effects models is one such sensitivity analysis [18]. Generally, the random-effects model produces wider CIs than does the fixed-effects model, and the level of statistical significance may therefore be different depending on the model used. The pooled point estimate per se is less likely to be affected, although exceptions are possible [19]. Other sensitivity analyses may include the examination of the residuals and the chi-square components [13] and assessment of the effect of deleting each study in turn. Statistically significant results that depend on a single study may require further exploration. Cumulative Meta-Analysis Cu