Logistic regression for stratified case-control studies.

N. Breslow,L. P. Zhao
DOI: https://doi.org/10.2307/2531601
IF: 1.701
1988-09-01
Biometrics
Abstract:Fears and Brown (1986) developed a procedure for logistic regression analysis of stratified case-control data where the sampling fractions for cases and controls, and thus their population frequencies, were assumed known. They fitted the usual "prospective" model to the case-control data, treating case-control status as a binary outcome variable. In order to adjust for the biased sampling, they included the logarithm of the odds ratio relating the actual sample sizes and the population frequencies in each stratum as an "offset" in the regression equation. However, no adjustments were made to the estimated variances of the regression coefficients of variables associated with the strata to account for the information about them available in the population frequencies. Furthermore, Fears and Brown incorrectly claimed that their procedure gave restricted maximum likelihood (RML) estimates (Aitchison and Silvey, 1958) based on the likelihood of the retrospectively sampled data. Breslow and Cain (1988) show that the Fears and Brown procedure does yield consistent and asymptotically normal estimates of the regression parameters in a logistic regression model for the probability of disease development. In fact, it is equivalent to the "conditional maximum likelihood" (CML) estimate developed by Manski and McFadden (1981) for estimation of quantal response functions from stratified data. (See also Hsieh, Manski, and McFadden, 1985.) Breslow and Cain extended the work of Manski and McFadden for use in the more realistic situation where the distribution of cases and controls in each stratum is estimated from a "first-stage" sample rather than being assumed known. They developed variance estimators for the regression coefficients that accurately reflect the additional information available in the first-stage sample and that are easily modified to accommodate an "infinite" population at that stage. We first present a reanalysis of the Fears and Brown data that contrasts the correct variances, computed under the assumptions that the first-stage sample is finite and infinite, respectively, with the incorrect variances obtained from the standard logistic analysis. Then, using a subset of the data with only three strata, we demonstrate the differences between the CML estimates of Manski and McFadden and RML estimates calculated from the retrospective probabilities. A small-scale simulation study investigates the properties of CML and RML estimators in samples of moderate size.
What problem does this paper attempt to address?