Variable selection for regression models with missing data

Ramon I. Garcia, Joseph G. Ibrahim, Hongtu Zhu

Research output: Contribution to journalArticlepeer-review

50 Scopus citations

Abstract

We consider the variable selection problem for a class of statistical models with missing data, including missing covariate and/or response data. We investigate the smoothly clipped absolute deviation penalty (SCAD) and adaptive LASSO and propose a unified model selection and estimation procedure for use in the presence of missing data. We develop a computationally attractive algorithm for simultaneously optimizing the penalized likelihood function and estimating the penalty parameters. Particularly, we propose to use a model selection criterion, called the ICQ statistic, for selecting the penalty parameters. We show that the variable selection procedure based on ICQ automatically and consistently selects the important covariates and leads to efficient estimates with oracle properties. The methodology is very general and can be applied to numerous situations involving missing data, from covariates missing at random in arbitrary regression models to nonignorably missing longitudinal responses and/or covariates. Simulations are given to demonstrate the methodology and examine the finite sample performance of the variable selection procedures. Melanoma data from a cancer clinical trial is presented to illustrate the proposed methodology.

Original languageEnglish (US)
Pages (from-to)149-165
Number of pages17
JournalStatistica Sinica
Volume20
Issue number1
StatePublished - Jan 2010

Keywords

  • EM algorithm
  • ICQ
  • Missing data
  • Penalized likelihood
  • Variable selection

ASJC Scopus subject areas

  • Statistics and Probability
  • Statistics, Probability and Uncertainty

Fingerprint

Dive into the research topics of 'Variable selection for regression models with missing data'. Together they form a unique fingerprint.

Cite this