A comparison of imputation techniques in the third national health and nutrition examination survey

Trena M. Ezzati-RJce,Meena Khare,Donald B. Rubin,Roderick J. A. Little,Joseph L. Schafer
2002-01-01
Abstract:Introduction The National Health and Nutrition Examination Survey (NHANES) is a periodic survey conducted by the National Center for Health Statistics (NCHS), Centers for Disease Control and Prevention. The NHANES is designed to provide national estimates of the health and nutritional status of the civilian noninstitutionalizedpopulation. Sociodemographic and medical history information are obtained through household interviews, while physical measurements, physiological tests, and biochemical measurements are collected through standardized physical examinations in mobile examination centers (MECs). The on-going Third NHANES or NHANES 111 is the seventh of an extensive series of periodic health and nutrition surveys that NCHS has conducted since 1960. The current NHANES HI, with a sample of approximately 40,000 sample persons 2 months of age and older, has been divided into two 3-year national samples. Phase 1 was conducted from October 1988 to October 1991 while Phase 2 will continue until October 1994. NHANES 111 is based on a complex, multistage area probability sample design and includes an oversample of children under 5 years of age, older Americans aged 60+ years, and both black and Mexican-American persons. Details of the sample design of NHANES 111 have been previously published (1). NHANES 111, like most sample surveys, experiences both total (unit) nonresponse and item nonresponse. The missing data problem for NHANES III is somewhat unique since sample persons can refuse to participate at three different stages of the data collection. Unit nonresponse rates for NHANES HIPhase 1 ranged from 0% for the screening interview (with about 7% of the screening data obtained from neighbors) to 14 % for the household interview to 22 % for the physical examination. It is common survey practice to compensate for unit nonresponse through weighting class adjustments (2-5). The adjustments to reduce potential nonresponse bias for NHANES IIIPhase 1 have been previously described (6). In addition to unit nonresponse, various levels of item nonresponse occur in NHANES HI. In Phase 1, item nonresponse of 1-5% occurred for the household interview questions. In addition, some components of the physical examination were not successfully completed for all sample persons. Furthermore, some examination components include a number of individual measurements (e.g., body measurements)--some of which may be missing. Item nonresponse rates for the individual components ranged from 5-8 %. Generally, item nonresponse is handled by some type of imputation. Imputation methods fill in missing items with values from similar units in the dataset or with predicted values obtained from a model, thus making it possible to analyze the data as if it were complete. Some common methods of imputation used in surveys include deductive imputation, mean imputation, Hot Deck imputation, Cold Deck imputation, regression imputation, stochastic regression, multiple imputation, and composite imputation methods (7). Each of these imputation methods has relative advantages and disadvantages. The method of choice for a survey may depend upon particular circumstances including the type of survey data and availability of computer hardware and software. In addition to allowing complete data methods of analysis, multiple imputation allows one to assess the impact of missing data uncertainty on the variances and to revise estimates of variance to reflect the additional uncertainty (8). In previous NHANES surveys, imputation for item nonresponse was done on an ad hoc basis. The purpose of this paper is to describe research conducted to compare alternative missing data adjustment methods for selected survey components in NHANES 111Phase 1 based on single and multiple imputation methodology. The information contained in this paper, in part, is based on a special project carded out during 1992 and contained in a f'mal report by Datametrics Research, Inc. (9).
What problem does this paper attempt to address?