aregImpute               package:Hmisc               R Documentation

_M_u_l_t_i_p_l_e _I_m_p_u_t_a_t_i_o_n _u_s_i_n_g _A_d_d_i_t_i_v_e _R_e_g_r_e_s_s_i_o_n, _B_o_o_t_s_t_r_a_p_p_i_n_g, _a_n_d
_P_r_e_d_i_c_t_i_v_e _M_e_a_n _M_a_t_c_h_i_n_g

_D_e_s_c_r_i_p_t_i_o_n:

     The 'transcan' function creates flexible additive imputation
     models but provides only an approximation to true multiple
     imputation as the imputation models are fixed before all multiple
     imputations are drawn.  This ignores variability caused by having
     to fit the imputation models.  'aregImpute' takes all aspects of
     uncertainty in the imputations into account by using the bootstrap
     to approximate the process of drawing predicted values from a full
     Bayesian predictive distribution.  Different bootstrap resamples
     are used for each of the multiple imputations, i.e., for the 'i'th
     imputation of a sometimes missing variable, 'i=1,2,... n.impute',
     a flexible additive model is fitted on a sample with replacement
     from the original data and this model is used to predict all of
     the original missing and non-missing values for the target
     variable.

     Two methods are used to fit the imputation models, 'ace' and
     'avas'.  Unless the identity transformation is specified, these
     methods simultaneously find transformations of the target variable
     and of all of the predictors, to get a good fit assuming
     additivity. 'ace' maximizes R-squared, and 'avas' attempts to
     maximize R-squared while stabilizing the variance of residuals. 
     When a categorical variable is being predicted, only 'ace' is
     used.  Like 'transcan's use of canonical regression, this is
     Fisher's optimum scoring method for categorical variables.  For
     continuous variables, monotonic transformations of the target
     variable are assumed when 'avas' is used.  For 'ace', the default
     allows nonmonotonic transformations of target variables.  When
     variables are used as predictors, the nonparametric
     transformations derived by 'ace' or 'avas' can be restricted by
     the user to be monotonic.

     Instead of taking random draws from fitted imputation models using
     random residuals as is done by 'transcan', 'aregImpute' uses
     predictive mean matching with optional weighted probability
     sampling of donors rather than using only the closest match. 
     Predictive mean matching works for binary, categorical, and
     continuous variables without the need for iterative maximum
     likelihood fitting for binary and categorical variables, and
     without the need for computing residuals or for curtailing imputed
     values to be in the range of actual data. Predictive mean matching
     is especially attractive when the variable being imputed is also
     being transformed automatically.  See Details below for more
     information about the algorithm.

     A 'print' method summarizes the results, and a 'plot' method plots
     distributions of imputed values. Typically, 'fit.mult.impute' will
     be called after 'aregImpute'.

_U_s_a_g_e:

     aregImpute(formula, data, subset, n.impute=5, group=NULL,
                method=c('ace','avas'), type=c('pmm','regression'),
                match=c('weighted','closest'), fweighted=0.2,
                defaultLinear=FALSE, x=FALSE, pr=TRUE, plotTrans=FALSE)
     ## S3 method for class 'aregImpute':
     print(x, ...)
     ## S3 method for class 'aregImpute':
     plot(x, nclass=NULL, type=c('ecdf','hist'), 
          diagnostics=FALSE, maxn=10, ...)

_A_r_g_u_m_e_n_t_s:

 formula: an S model formula.  You can specify restrictions for
          transformations of variables.  The function automatically
          determines which variables are categorical (i.e., 'factor',
          'category', or character vectors). Binary variables are
          automatically restricted to be linear.  Force linear
          transformations of continuous variables by enclosing
          variables by the identify function ('I()'), and specify
          monotonicity by using 'monotone(variable)'. 

       x: an object created by 'aregImpute'.  For 'aregImpute', set 'x'
          to 'TRUE' to save the data matrix containing the final
          (number 'n.impute') imputations in the result.  This is
          needed if you want to later do out-of-sample imputation.
          Categorical variables are coded as integers in this matrix. 

    data: 

  subset: These may be also be specified.  You may not specify
          'na.action' as 'na.retain' is always used. 

n.impute: number of multiple imputations.  'n.impute=5' is frequently
          recommended but 10 or more doesn't hurt. 

   group: a character or factor variable the same length as the number
          of observations in 'data' and containing no 'NA's. When
          'group' is present, causes a bootstrap sample of the
          observations corresponding to non-'NA's of a target variable
          to have the same frequency distribution of 'group' as the
          that in the non-'NA's of the original sample.  This can
          handle k-sample problems as well as lower the chance that a
          bootstrap sample will have a missing cell when the original
          cell frequency was low. 

  method: method ('"ace"', the default, or '"avas"') for modeling a
          variable to be imputed.  As 'avas' does not allow the
          response variable to be categorical, '"ace"' is always used
          for such variables. 

    type: The default is '"pmn"' for predictive mean matching, which is
          a more nonparametric approach that will work for categorical
          as well as continuous predictors.  Alternatively, use
          '"regression"' when all variables that are sometimes missing
          are continuous and the missingness mechanism is such that
          entire intervals of population values are unobserved.  See
          the Details section for more information.  For the 'plot'
          method,  specify 'type="hist"' to draw histograms of imputed
          values with rug plots at the top, or 'type="ecdf"' (the
          default) to draw empirical CDFs with spike histograms at the
          bottom. 

   match: Defaults to 'match="weighted"' to do weighted multinomial
          probability sampling using the tricube function (similar to
          lowess) as the weights.  The argument of the tricube function
          is the absolute difference in transformed predicted values of
          all the donors and of the target predicted value, divided by
          a scaling factor. The scaling factor in the tricube function
          is 'fweighted' times the mean absolute difference between the
          target predicted value and all the possible donor predicted
          values.  Set 'match="closest"' to find as the donor the
          observation having the closest predicted transformed value,
          even if that same donor is found repeatedly.

fweighted: Smoothing parameter (multiple of mean absolute difference)
          used when 'match="weighted"', with a default value of 0.2. 
          Set 'fweighted' to a number between 0.02 and 0.2 to force the
          donor to have a predicted value closer to the target, and set
          'fweighted' to larger values (but seldom larger than 1.0) to
          allow donor values to be less tightly matched.  See the
          examples below to learn how to study the relationship between
          'fweighted' and the standard deviation of multiple
          imputations within individuals.

defaultLinear: set to 'TRUE' to force all continuous variables to be
          linear in any model.  This is recommended when the sample
          size is small. 

      pr: set to 'FALSE' to suppress printing of iteration messages 

plotTrans: set to 'TRUE' to plot 'ace' or 'avas' transformations for
          each variable for each of the multiple imputations.  This is
          useful for determining whether transformations are
          reasonable.  If transformations are too noisy or have long
          flat sections (resulting in "lumps" in the distribution of
          imputed values), it may be advisable to place restrictions on
          the transformations (monotonicity or linearity). 

  nclass: number of bins to use in drawing histogram 

diagnostics: Specify 'diagnostics=TRUE' to draw plots of imputed values
          against sequential imputation numbers, separately for each
          missing observations and variable.  

    maxn: Maximum number of observations shown for diagnostics. 
          Default is 'maxn=10', which limits the number of observations
          plotted to at most the first 10. 

     ...: other arguments that are ignored 

_D_e_t_a_i_l_s:

     The sequence of steps used by the 'aregImpute' algorithm is the
     following. 
      (1) For each variable containing m 'NA's where m > 0, initialize
     the 'NA's to values from a random sample (without replacement if a
     sufficient number of non-missing values exist) of size m from the
     non-missing values. 
      (2) For '3+n.impute' iterations do the following steps.  The
     first 3 iterations provide a burn-in, and imputations are saved
     only from the last 'n.impute' iterations. 
      (3) For each variable containing any 'NA's, draw a sample with
     replacement from the observations in the entire dataset in which
     the current variable being imputed is non-missing.  Fit a flexible
     additive model to predict this target variable while finding the
     optimum transformation of it (unless the identity transformation
     is forced).  Use this fitted semiparametric model to predict the
     target variable in all of the original observations. Impute each
     missing value of the target variable with the observed value whose
     predicted transformed value is closest to the predicted
     transformed value of the missing value (if 'match="closest"' and
     'type="pmm"'),  or use a draw from a multinomial distribution with
     probabilities derived from distance weights, if 'match="weighted"'
     (the default). 
      (4) After these imputations are computed, use these random draw
     imputations the next time the curent target variable is used as a
     predictor of other sometimes-missing variables.

     When 'match="closest"', predictive mean matching does not work
     well when fewer than 3 variables are used to predict the target
     variable, because many of the multiple imputations for an
     observation will be identical.  In the extreme case of one
     right-hand-side variable and assuming that only monotonic
     transformations of left and right-side variables are allowed,
     every bootstrap resample will give predicted values of the target
     variable that are monotonically related to predicted values from
     every other bootstrap resample.  The same is true for Bayesian
     predicted values.  This causes predictive mean matching to always
     match on the same donor observation.

     When the missingness mechanism for a variable is so systematic
     that the distribution of observed values is truncated, predictive
     mean matching does not work.  It will only yield imputed values
     that are near observed values, so intervals in which no values are
     observed will not be populated by imputed values.  For this case,
     the only hope is to make regression assumptions and use
     extrapolation.  With 'type="regression"', 'aregImpute' will use
     linear extrapolation to obtain a (hopefully) reasonable
     distribution of imputed values.  The '"regression"' option causes
     'aregImpute' to impute missing values by adding a random sample of
     residuals (with replacement if there are more 'NA's than measured
     values) on the scale of the 'ace' or 'avas' transformed target
     variable. After random residuals are added, predicted random draws
     are obtained on the original untransformed scale using reverse
     linear interpolation on the table of original and 'ace' or 'avas'
     transformed target values (linear extrapolation when a random
     residual is large enough to put the random draw prediction outside
     the range of observed values). The bootstrap is used as with
     'type="pmm"' to factor in the uncertainty of the imputation model.

_V_a_l_u_e:

     a list of class '"aregImpute"' containing the following elements:

    call: the function call expression 

 formula: the formula specified to 'aregImpute' 

  method: the 'method' argument 

       n: total number of observations in input dataset 

       p: number of variables 

      na: list of subscripts of observations for which values were
          originally missing 

     nna: named vector containing the numbers of missing values in the
          data 

  linear: vector of names of variables restricted to be linear 

categorical: vector of names of categorical variables 

monotone: vector of names of variables restricted to be monotonic 

cat.levels: list containing character vectors specifying the 'levels'
          of categorical variables 

n.impute: number of multiple imputations per missing value 

 imputed: a list containing matrices of imputed values in the same
          format as those created by 'transcan'.  Categorical variables
          are coded using their integer codes.  Variables having no
          missing values will have 'NULL' matrices in the list. 

     rsq: for the last round of imputations, a vector containing the
          R-squares with which each sometimes-missing variable could be
          predicted from the others by 'ace' or 'avas'. 

_A_u_t_h_o_r(_s):

     Frank Harrell 
      Department of Biostatistics 
      Vanderbilt University 
      f.harrell@vanderbilt.edu

_S_e_e _A_l_s_o:

     'fit.mult.impute', 'transcan', 'ace', 'naclus', 'naplot', 'mice',
     'dotchart2', 'ecdf'

_E_x_a_m_p_l_e_s:

     # Multiple imputation and estimation of variances and covariances of
     # regression coefficient estimates accounting for imputation
     # Example 1: large sample size, much missing data, no overlap in
     # NAs across variables
     set.seed(3)
     x1 <- factor(sample(c('a','b','c'),1000,TRUE))
     x2 <- (x1=='b') + 3*(x1=='c') + rnorm(1000,0,2)
     x3 <- rnorm(1000)
     y  <- x2 + 1*(x1=='c') + .2*x3 + rnorm(1000,0,2)
     orig.x1 <- x1[1:250]
     orig.x2 <- x2[251:350]
     x1[1:250] <- NA
     x2[251:350] <- NA
     d <- data.frame(x1,x2,x3,y)

     # Use 100 imputations to better check against individual true values
     f <- aregImpute(~y + x1 + x2 + x3, n.impute=100, data=d)
     f
     par(mfrow=c(2,1))
     plot(f)
     modecat <- function(u) {
      tab <- table(u)
      as.numeric(names(tab)[tab==max(tab)][1])
     }
     table(orig.x1,apply(f$imputed$x1, 1, modecat))
     par(mfrow=c(1,1))
     plot(orig.x2, apply(f$imputed$x2, 1, mean))
     fmi <- fit.mult.impute(y ~ x1 + x2 + x3, lm, f, 
                            data=d)
     sqrt(diag(Varcov(fmi)))
     fcc <- lm(y ~ x1 + x2 + x3)
     summary(fcc)   # SEs are larger than from mult. imputation

     # Example 2: Very discriminating imputation models,
     # x1 and x2 have some NAs on the same rows, smaller n
     set.seed(5)
     x1 <- factor(sample(c('a','b','c'),100,TRUE))
     x2 <- (x1=='b') + 3*(x1=='c') + rnorm(100,0,.4)
     x3 <- rnorm(100)
     y  <- x2 + 1*(x1=='c') + .2*x3 + rnorm(100,0,.4)
     orig.x1 <- x1[1:20]
     orig.x2 <- x2[18:23]
     x1[1:20] <- NA
     x2[18:23] <- NA
     #x2[21:25] <- NA
     d <- data.frame(x1,x2,x3,y)
     n <- naclus(d)
     plot(n); naplot(n)  # Show patterns of NAs
     # 100 imputations to study them; normally use 5 or 10
     f  <- aregImpute(~y + x1 + x2 + x3, n.impute=100, defaultLinear=TRUE, data=d)
     par(mfrow=c(2,3))
     plot(f, diagnostics=TRUE, maxn=2)
     # Note: diagnostics=TRUE makes graphs similar to those made by:
     # r <- range(f$imputed$x2, orig.x2)
     # for(i in 1:6) {  # use 1:2 to mimic maxn=2
     #   plot(1:100, f$imputed$x2[i,], ylim=r,
     #        ylab=paste("Imputations for Obs.",i))
     #   abline(h=orig.x2[i],lty=2)
     # }

     table(orig.x1,apply(f$imputed$x1, 1, modecat))
     par(mfrow=c(1,1))
     plot(orig.x2, apply(f$imputed$x2, 1, mean))

     fmi <- fit.mult.impute(y ~ x1 + x2, lm, f, 
                            data=d)
     sqrt(diag(Varcov(fmi)))
     fcc <- lm(y ~ x1 + x2)
     summary(fcc)   # SEs are larger than from mult. imputation

     # Study relationship between smoothing parameter for weighting function
     # (multiplier of mean absolute distance of transformed predicted
     # values, used in tricube weighting function) and standard deviation
     # of multiple imputations.  SDs are computed from average variances
     # across subjects.  match="closest" same as match="weighted" with
     # small value of fweighted.
     # This example also shows problems with predicted mean
     # matching almost always giving the same imputed values when there is
     # only one predictor (regression coefficients change over multiple
     # imputations but predicted values are virtually 1-1 functions of each
     # other)

     set.seed(23)
     x <- runif(200)
     y <- x + runif(200, -.05, .05)
     r <- resid(lsfit(x,y))
     rmse <- sqrt(sum(r^2)/(200-2))   # sqrt of residual MSE

     y[1:20] <- NA
     d <- data.frame(x,y)
     f <- aregImpute(~ x + y, n.impute=10, match='closest', data=d)
     sd <- sqrt(mean(apply(f$imputed$y, 1, var)))

     ss <- c(0, .01, .02, seq(.05, 1, length=20))
     sds <- ss; sds[1] <- sd

     for(i in 2:length(ss)) {
       f <- aregImpute(~ x + y, n.impute=10, fweighted=ss[i])
       sds[i] <- sqrt(mean(apply(f$imputed$y, 1, var)))
     }

     plot(ss, sds, xlab='Smoothing Parameter', ylab='SD of Imputed Values',
          type='b')
     abline(v=.2,  lty=2)  # default value of fweighted
     abline(h=rmse, lty=2)  # root MSE of residuals from linear regression

     ## Not run: 
     # Do a similar experiment for the Titanic dataset
     getHdata(titanic3)
     h <- lm(age ~ sex + pclass + survived, data=titanic3)
     rmse <- summary(h)$sigma
     set.seed(21)
     f <- aregImpute(~ age + sex + pclass + survived, n.impute=10,
                     data=titanic3, match='closest')
     sd <- sqrt(mean(apply(f$imputed$age, 1, var)))

     ss <- c(0, .01, .02, seq(.05, 1, length=20))
     sds <- ss; sds[1] <- sd

     for(i in 2:length(ss)) {
       f <- aregImpute(~ age + sex + pclass + survived, data=titanic3,
                       n.impute=10, fweighted=ss[i])
       sds[i] <- sqrt(mean(apply(f$imputed$age, 1, var)))
     }

     plot(ss, sds, xlab='Smoothing Parameter', ylab='SD of Imputed Values',
          type='b')
     abline(v=.2,   lty=2)  # default value of fweighted
     abline(h=rmse, lty=2)  # root MSE of residuals from linear regression
     ## End(Not run)

