bootcov                package:Design                R Documentation

_B_o_o_t_s_t_r_a_p _C_o_v_a_r_i_a_n_c_e _a_n_d _D_i_s_t_r_i_b_u_t_i_o_n _f_o_r _R_e_g_r_e_s_s_i_o_n _C_o_e_f_f_i_c_i_e_n_t_s

_D_e_s_c_r_i_p_t_i_o_n:

     'bootcov' computes a bootstrap estimate of the covariance matrix
     for a set of regression coefficients from 'ols', 'lrm', 'cph',
     'psm' and any other fit where 'x=TRUE, y=TRUE' was used to store
     the data used in making the original regression fit and where an
     appropriate 'fitter' function is provided here.  The estimates
     obtained are not conditional on the design matrix, but are instead
     unconditional estimates.  For small sample sizes, this will make a
     difference as the unconditional variance estimates are larger. 
     This function will also obtain bootstrap estimates corrected for
     cluster sampling (intra-cluster correlations) when a "working
     independence" model was used to fit data which were correlated
     within clusters.  This is done by substituting cluster sampling
     with replacement for the usual simple sampling with replacement. 
     'bootcov' has an option ('coef.reps') that causes all of the
     regression coefficient estimates from all of the bootstrap
     re-samples to be saved, facilitating computation of nonparametric
     bootstrap confidence limits and plotting of the distributions of
     the coefficient estimates (using histograms and kernel smoothing
     estimates).

     The 'loglik' option facilitates the calculation of simultaneous
     confidence regions from quantities of interest that are functions
     of the regression coefficients, using the method of
     Tibshirani(1996). With Tibshirani's method, one computes the
     objective criterion (-2 log likelihood evaluated at the bootstrap
     estimate of beta but with respect to the original design matrix
     and response vector) for the original fit as well as for all of
     the bootstrap fits.  The confidence set of the regression
     coefficients is the set of all coefficients that are associated
     with objective function values that are less than or equal to say
     the 0.95 quantile of the vector of 'B + 1' objective function
     values.  For the coefficients satisfying this condition, predicted
     values are computed at a user-specified design matrix 'X', and
     minima and maxima of these predicted values (over the qualifying
     bootstrap repetitions) are computed to derive the final
     simultaneous confidence band.

     The 'bootplot' function takes the output of 'bootcov' and  either
     plots a histogram and kernel density estimate of specified
     regression coefficients (or linear combinations of them through
     the use of a specified design matrix 'X'), or a 'qqnorm' plot of
     the quantities of interest to check for normality of the maximum
     likelihood estimates.  'bootplot' draws vertical lines at
     specified quantiles of the bootstrap distribution, and returns
     these quantiles for possible printing by the user.  Bootstrap
     estimates may optionally be transformed by a user-specified
     function 'fun' before plotting.

     The 'confplot' function also uses the output of 'bootcov' but to
     compute and optionally plot nonparametric bootstrap pointwise
     confidence limits or (by default) Tibshirani (1996) simultaneous
     confidence sets. A design matrix must be specified to allow
     'confplot' to compute quantities of interest such as predicted
     values across a range of values or differences in predicted values
     (plots of effects of changing one or more predictor variable
     values).

     'bootplot' and 'confplot' are actually generic functions, with the
     particular functions 'bootplot.bootcov' and 'confplot.bootcov'
     automatically invoked for 'bootcov' objects.

     A service function called 'histdensity' is also provided (for use
     with 'bootplot').  It runs 'hist' and 'density' on the same plot,
     using twice the number of classes than the default for 'hist', and
     1.5 times the 'width' than the default used by 'density'.

     A comprehensive example demonstrates the use of all of the
     functions.

_U_s_a_g_e:

     bootcov(fit, cluster, B=200, fitter, 
             coef.reps=FALSE, loglik=coef.reps,
             pr=FALSE, maxit=15, group)

     bootplot(obj, which, X,
              conf.int=c(.9,.95,.99),
              what=c('density','qqnorm'),
              fun=function(x)x, labels., ...)

     confplot(obj, X, against, 
              method=c('simultaneous','pointwise'),
              conf.int=0.95, fun=function(x)x,
              add=FALSE, lty.conf=2, ...)

     histdensity(y, xlab, nclass, width, mult.width=1, ...)

_A_r_g_u_m_e_n_t_s:

     fit: a fit object containing components 'x' and 'y'.  For fits
          from 'cph', the '"strata"' attribute of the 'x' component is
          used to obtain the vector of stratum codes. 

     obj: an object created by 'bootcov' with 'coef.reps=TRUE'. 

       X: a design matrix specified to 'confplot'.  See
          'predict.Design' or 'contrast.Design'.  For 'bootplot', 'X'
          is optional. 

       y: a vector to pass to 'histdensity'.  'NA's are ignored. 

 cluster: a variable indicating groupings. 'cluster' may be any type of
          vector (factor, character, integer). Unique values of
          'cluster' indicate possibly correlated groupings of
          observations. Note the data used in the fit and stored in
          'fit$x' and 'fit$y' may have had observations containing
          missing values deleted.  It is assumed that if there were any
          NAs, an 'naresid' function exists for the class of 'fit'.
          This function restores NAs so that the rows of the design
          matrix coincide with 'cluster'. 

       B: number of bootstrap repetitions.  Default is 200. 

  fitter: the name of a function with arguments '(x,y)' that will fit
          bootstrap samples.  Default is taken from the class of 'fit'
          if it is 'ols', 'lrm', 'cph', 'psm'. 

coef.reps: set to 'TRUE' if you want to store a matrix of all bootstrap
          regression coefficient estimates in the returned component
          'boot.Coef'. 

  loglik: set to 'TRUE' to store -2 log likelihoods for each bootstrap
          model, evaluated against the original 'x' and 'y' data.  The
          default is to do this when 'coef.reps' is specified as
          'TRUE'.  The use of 'loglik=TRUE' assumes that an
          'oos.loglik' method exists for the type of model being
          analyzed, to calculate out-of-sample -2 log likelihoods (see
          'Design.Misc'). After the 'B' -2 log likelihoods (stored in
          the element named 'boot.loglik' in the returned fit object),
          the 'B+1' element is the -2 log likelihood for the original
          model fit. 

      pr: set to 'TRUE' to print the current sample number to monitor
          progress. 

   maxit: maximum number of iterations, to pass to 'fitter' 

   group: a grouping variable used to stratify the sample upon
          bootstrapping. This allows one to handle k-sample problems,
          i.e., each bootstrap sample will be forced to select the same
          number of observations from each level of group as the number
          appearing in the original dataset. You may specify both
          'group' and 'cluster'. 

   which: one or more integers specifying which regression coefficients
          to plot for 'bootplot' 

conf.int: a vector (for 'bootplot', default is 'c(.9,.95,.99)') or
          scalar  (for 'confplot', default is '.95') confidence level. 

    what: for 'bootplot', specifies whether a density or a q-q plot is
          made 

     fun: for 'bootplot' or 'confplot' specifies a function used to
          translate the quantities of interest before analysis.  A
          common choice is 'fun=exp' to compute anti-logs, e.g., odds
          ratios. 

 labels.: a vector of labels for labeling the axes in plots produced by
          'bootplot'. Default is row names of 'X' if there are any, or
          sequential integers. 

     ...: For 'bootplot' these are optional arguments passed to
          'histdensity'.  Also may be optional arguments passed to
          'plot' by 'confplot' or optional arguments passed to 'hist'
          from 'histdensity', such as 'xlim' and 'breaks'.  The
          argument 'probability=TRUE' is always passed to 'hist'. 

 against: For 'confplot', specifying 'against' causes a plot to be made
          (or added to). The 'against' variable is associated with rows
          of 'X' and is used as the x-coordinates. 

  method: specifies whether '"pointwise"' or '"simultaneous"'
          confidence regions are derived by 'confplot'.  The default is
          simultaneous. 

     add: set to 'TRUE' to add to an existing plot, for 'confplot' 

lty.conf: line type for plotting confidence bands in 'confplot'. 
          Default is 2 for dotted lines. 

    xlab: label for x-axis for 'histdensity'.  Default is 'label'
          attribute or argument name if there is no 'label'. 

  nclass: passed to 'hist' if present 

   width: passed to 'density' if present 

mult.width: multiplier by which to adjust the default 'width' passed to
          'density'. Default is 1. 

_D_e_t_a_i_l_s:

     If the fit has a scale parameter (e.g., a fit from 'psm'), the log
     of the individual bootstrap scale estimates are added to the
     vector of parameter estimates and and column and row for the log
     scale are added to the new covariance matrix (the old covariance
     matrix also has this row and column).

_V_a_l_u_e:

     a new fit object with class of the original object and with the
     element 'orig.var' added. 'orig.var' is the covariance matrix of
     the original fit.  Also, the original 'var' component is replaced
     with the new bootstrap estimates.  The component 'boot.coef' is
     also added.  This contains the mean bootstrap estimates of
     regression coefficients (with a log scale element added if
     applicable).  'boot.Coef' is added if 'coef.reps=TRUE'. 
     'boot.loglik' is added if 'loglik=TRUE'.

     'bootplot' returns a (possible matrix) of quantities of interest
     and the requested quantiles of them.  'confplot' returns three
     vectors: 'fitted', 'lower', and 'upper'.

_S_i_d_e _E_f_f_e_c_t_s:

     'bootcov' prints if 'pr=TRUE'

_A_u_t_h_o_r(_s):

     Frank Harrell
      Department of Biostatistics
      Vanderbilt University
      f.harrell@vanderbilt.edu

     Bill Pikounis
      Biometrics Research Department
      Merck Research Laboratories
      v_bill_pikounis@merck.com

_R_e_f_e_r_e_n_c_e_s:

     Feng Z, McLerran D, Grizzle J (1996): A comparison of statistical
     methods for clustered data analysis with Gaussian error.  Stat in
     Med 15:1793-1806.

     Tibshirani R, Knight K (1996): Model search and inference by
     bootstrap  "bumping". Department of Statistics, University of
     Toronto.  Technical report available from 
      http://www-stat.stanford.edu/~tibs/. Presented at the Joint
     Statistical Meetings, Chicago, August 1996.

_S_e_e _A_l_s_o:

     'robcov', 'sample', 'Design', 'lm.fit', 'lrm.fit', 'coxph.fit',
     'survreg.fit', 'predab.resample', 'Design.Misc', 'predict.Design',
     'gendata', 'contrast.Design'

_E_x_a_m_p_l_e_s:

     set.seed(191)
     x <- exp(rnorm(200))
     logit <- 1 + x/2
     y <- ifelse(runif(200) <= plogis(logit), 1, 0)
     f <- lrm(y ~ pol(x,2), x=TRUE, y=TRUE)
     g <- bootcov(f, B=50, pr=TRUE, coef.reps=TRUE)
     anova(g)    # using bootstrap covariance estimates
     fastbw(g)   # using bootstrap covariance estimates
     beta <- g$boot.Coef[,1]
     hist(beta, nclass=15)     #look at normality of parameter estimates
     qqnorm(beta)
     # bootplot would be better than these last two commands

     # A dataset contains a variable number of observations per subject,
     # and all observations are laid out in separate rows. The responses
     # represent whether or not a given segment of the coronary arteries
     # is occluded. Segments of arteries may not operate independently
     # in the same patient.  We assume a "working independence model" to
     # get estimates of the coefficients, i.e., that estimates assuming
     # independence are reasonably efficient.  The job is then to get
     # unbiased estimates of variances and covariances of these estimates.

     set.seed(1)
     n.subjects <- 30
     ages <- rnorm(n.subjects, 50, 15)
     sexes  <- factor(sample(c('female','male'), n.subjects, TRUE))
     logit <- (ages-50)/5
     prob <- plogis(logit)  # true prob not related to sex
     id <- sample(1:n.subjects, 300, TRUE) # subjects sampled multiple times
     table(table(id))  # frequencies of number of obs/subject
     age <- ages[id]
     sex <- sexes[id]
     # In truth, observations within subject are independent:
     y   <- ifelse(runif(300) <= prob[id], 1, 0)
     f <- lrm(y ~ lsp(age,50)*sex, x=TRUE, y=TRUE)
     g <- bootcov(f, id, B=50)  # usually do B=200 or more
     diag(g$var)/diag(f$var)
     # add ,group=w to re-sample from within each level of w
     anova(g)            # cluster-adjusted Wald statistics
     # fastbw(g)         # cluster-adjusted backward elimination
     plot(g, age=30:70, sex='female')  # cluster-adjusted confidence bands

     # Get design effects based on inflation of the variances when compared
     # with bootstrap estimates which ignore clustering
     g2 <- bootcov(f, B=50)
     diag(g$var)/diag(g2$var)

     # Get design effects based on pooled tests of factors in model
     anova(g2)[,1] / anova(g)[,1]

     # Simulate binary data where there is a strong 
     # age x sex interaction with linear age effects 
     # for both sexes, but where not knowing that
     # we fit a quadratic model.  Use the bootstrap
     # to get bootstrap distributions of various
     # effects, and to get pointwise and simultaneous
     # confidence limits

     set.seed(71)
     n   <- 500
     age <- rnorm(n, 50, 10)
     sex <- factor(sample(c('female','male'), n, rep=TRUE))
     L   <- ifelse(sex=='male', 0, .1*(age-50))
     y   <- ifelse(runif(n)<=plogis(L), 1, 0)

     f <- lrm(y ~ sex*pol(age,2), x=TRUE, y=TRUE)
     b <- bootcov(f, B=50, coef.reps=TRUE, pr=TRUE)   # better: B=500

     par(mfrow=c(2,3))
     # Assess normality of regression estimates
     bootplot(b, which=1:6, what='qq')
     # They appear somewhat non-normal

     # Plot histograms and estimated densities 
     # for 6 coefficients
     w <- bootplot(b, which=1:6)
     # Print bootstrap quantiles
     w$quantiles

     # Estimate regression function for females
     # for a sequence of ages
     ages <- seq(25, 75, length=100)
     label(ages) <- 'Age'

     # Plot fitted function and pointwise normal-
     # theory confidence bands
     par(mfrow=c(1,1))
     p <- plot(f, age=ages, sex='female')
     w <- p$x.xbeta
     # Save curve coordinates for later automatic
     # labeling using labcurve in the Hmisc library
     curves <- vector('list',8)
     curves[[1]] <- list(x=w[,1],y=w[,3])
     curves[[2]] <- list(x=w[,1],y=w[,4])

     # Add pointwise normal-distribution confidence 
     # bands using unconditional variance-covariance
     # matrix from the 500 bootstrap reps
     p <- plot(b, age=ages, sex='female', add=TRUE, lty=3)
     w <- p$x.xbeta
     curves[[3]] <- list(x=w[,1],y=w[,3])
     curves[[4]] <- list(x=w[,1],y=w[,4])

     dframe <- expand.grid(sex='female', age=ages)
     X <- predict(f, dframe, type='x')  # Full design matrix

     # Add pointwise bootstrap nonparametric 
     # confidence limits
     p <- confplot(b, X=X, against=ages, method='pointwise',
                   add=TRUE, lty.conf=4)
     curves[[5]] <- list(x=ages, y=p$lower)
     curves[[6]] <- list(x=ages, y=p$upper)

     # Add simultaneous bootstrap confidence band
     p <- confplot(b, X=X, against=ages, add=TRUE, lty.conf=5)
     curves[[7]] <- list(x=ages, y=p$lower)
     curves[[8]] <- list(x=ages, y=p$upper)
     lab <- c('a','a','b','b','c','c','d','d')
     labcurve(curves, lab)

     # Now get bootstrap simultaneous confidence set for
     # female:male odds ratios for a variety of ages

     dframe <- expand.grid(age=ages, sex=c('female','male'))
     X <- predict(f, dframe, type='x')  # design matrix
     f.minus.m <- X[1:100,] - X[101:200,]
     # First 100 rows are for females.  By subtracting
     # design matrices are able to get Xf*Beta - Xm*Beta
     # = (Xf - Xm)*Beta

     confplot(b, X=f.minus.m, against=ages,
              method='pointwise', ylab='F:M Log Odds Ratio')
     confplot(b, X=f.minus.m, against=ages,
              lty.conf=3, add=TRUE)

     # contrast.Design makes it easier to compute the design matrix for use
     # in bootstrapping contrasts:

     f.minus.m <- contrast(f, list(sex='female',age=ages),
                              list(sex='male',  age=ages))$X
     confplot(b, X=f.minus.m)

     # For a quadratic binary logistic regression model use bootstrap
     # bumping to estimate coefficients under a monotonicity constraint
     set.seed(177)
     n <- 400
     x <- runif(n)
     logit <- 3*(x^2-1)
     y <- rbinom(n, size=1, prob=plogis(logit))
     f <- lrm(y ~ pol(x,2), x=TRUE, y=TRUE)
     k <- coef(f)
     k
     vertex <- -k[2]/(2*k[3])
     vertex

     # Outside [0,1] so fit satisfies monotonicity constraint within
     # x in [0,1], i.e., original fit is the constrained MLE

     g <- bootcov(f, B=50, coef.reps=TRUE)
     bootcoef <- g$boot.Coef    # 100x3 matrix
     vertex <- -bootcoef[,2]/(2*bootcoef[,3])
     table(cut2(vertex, c(0,1)))
     mono <- !(vertex >= 0 & vertex <= 1)
     mean(mono)    # estimate of Prob{monotonicity in [0,1]}

     var(bootcoef)   # var-cov matrix for unconstrained estimates
     var(bootcoef[mono,])   # for constrained estimates

     # Find second-best vector of coefficient estimates, i.e., best
     # from among bootstrap estimates
     g$boot.Coef[order(g$boot.loglik[-length(g$boot.loglik)])[1],]
     # Note closeness to MLE

