ols                  package:Design                  R Documentation

_L_i_n_e_a_r _M_o_d_e_l _E_s_t_i_m_a_t_i_o_n _U_s_i_n_g _O_r_d_i_n_a_r_y _L_e_a_s_t _S_q_u_a_r_e_s

_D_e_s_c_r_i_p_t_i_o_n:

     Fits the usual weighted or unweighted linear regression model
     using the same fitting routines used by 'lm', but also storing the
     variance-covariance matrix 'var' and using traditional
     dummy-variable coding for categorical factors.   Also fits
     unweighted models using penalized least squares, with the same
     penalization options as in the 'lrm' function.  For penalized
     estimation, there is a fitter function call 'lm.pfit'.

_U_s_a_g_e:

     ols(formula, data, weights, subset, na.action=na.delete, 
         method="qr", model=FALSE,
         x=FALSE, y=FALSE, se.fit=FALSE, linear.predictors=TRUE,
         penalty=0, penalty.matrix, tol=1e-7, sigma,
         var.penalty=c('simple','sandwich'), ...)

_A_r_g_u_m_e_n_t_s:

 formula: an S formula object, e.g.  
           Y ~ rcs(x1,5)*lsp(x2,c(10,20)) 

    data: name of an S data frame containing all needed variables. 
          Omit this to use a data frame already in the S ``search
          list''. 

 weights: an optional vector of weights to be used in the fitting
          process. If specified, weighted least squares is used with
          weights 'weights' (that is, minimizing sum(w*e^2)); otherwise
          ordinary least squares is used.

  subset: an expression defining a subset of the observations to use in
          the fit.  The default is to use all observations.  Specify
          for example 'age>50 & sex="male"' or 'c(1:100,200:300)'
          respectively to use the observations satisfying a logical
          expression or those having row numbers in the given vector. 

na.action: specifies an S function to handle missing data.  The default
          is the function 'na.delete', which causes observations with
          any variable missing to be deleted.  The main difference
          between 'na.delete' and the S-supplied function 'na.omit' is
          that  'na.delete' makes a list of the number of observations
          that are missing on each variable in the model. The
          'na.action' is usally specified by e.g.
          'options(na.action="na.delete")'. 

  method: specifies a particular fitting method, or '"model.frame"'
          instead to return the model frame of the predictor and
          response variables satisfying any subset or missing value
          checks. 

   model: default is 'FALSE'.  Set to 'TRUE' to return the model frame
          as element 'model' of the fit object. 

       x: default is 'FALSE'.  Set to 'TRUE' to return the expanded
          design matrix as element 'x' (without intercept indicators)
          of the returned fit object.  Set both 'x=TRUE' if you are
          going to use the 'residuals' function later to return
          anything other than ordinary residuals. 

       y: default is 'FALSE'.  Set to 'TRUE' to return the vector of
          response values  as element 'y' of the fit. 

  se.fit: default is 'FALSE'.  Set to 'TRUE' to compute the estimated
          standard errors of the estimate of X beta and store them in
          element 'se.fit' of the fit.  

linear.predictors: set to 'FALSE' to cause predicted values not to be
          stored 

 penalty: 

penalty.matrix: see 'lrm' 

     tol: tolerance for information matrix singularity

   sigma: If 'sigma' is given, it is taken as the actual root mean
          squared error parameter for the model.  Otherwise 'sigma' is
          estimated from the data using the usual formulas (except for
          penalized models).  It is often convenient to specify
          'sigma=1' for models with no error, when using 'fastbw' to
          find an approximate model that predicts predicted values from
          the full model with a given accuracy. 

var.penalty: the type of variance-covariance matrix to be stored in the
          'var' component of the fit when penalization is used.  The
          default is the inverse of the penalized information matrix. 
          Specify 'var.penalty="sandwich"' to use the sandwich
          estimator (see below under 'var'), which limited simulation
          studies have shown yields variances estimates that are too
          low. 

     ...: arguments to pass to 'lm.wfit' or 'lm.fit'

_D_e_t_a_i_l_s:

     For penalized estimation, the penalty factor on the log likelihood
     is -0.5 beta' P beta / sigma^2, where P is defined above. The
     penalized maximum likelihood estimate (penalized least squares or
     ridge estimate) of beta is (X'X + P)^{-1} X'Y. The maximum
     likelihood estimate of sigma^2 is (sse + beta' P beta) / n, where
     'sse' is the sum of squared errors (residuals). The
     'effective.df.diagonal' vector is the diagonal of the matrix
     X'X/(sse/n) sigma^{2} (X'X + P)^{-1}.

_V_a_l_u_e:

     the same objects returned from 'lm' (unless 'penalty' or
     'penalty.matrix' are given - then an abbreviated list is returned
     since 'lm.pfit' is used as a fitter) plus the design attributes
     (see 'Design'). Predicted values are always returned, in the
     element 'linear.predictors'. The vectors or matrix stored if
     'y=TRUE' or 'x=TRUE' have rows deleted according to 'subset' and
     to missing data, and have names or row names that come from the
     data frame used as input data.  If 'penalty' or 'penalty.matrix'
     is given,  the 'var' matrix returned is an improved
     variance-covariance matrix for the penalized regression
     coefficient estimates.  If 'var.penalty="sandwich"' (not the
     default, as limited simulation studies have found it provides
     variance estimates that are too low) it is defined as  sigma^{2}
     (X'X + P)^{-1} X'X (X'X + P)^{-1}, where P is  'penalty factors *
     penalty.matrix', with a column and row of zeros added for the
     intercept.  When 'var.penalty="simple"' (the default), 'var' is
     sigma^{2} (X'X + P)^{-1}. The returned list has a vector 'stats'
     with named elements 'n, Model L.R., d.f., R2, Sigma'.  'Model
     L.R.' is the model likelihood ratio chi-square statistic, and 'R2'
     is R^2.  For penalized estimation, 'd.f.' is the  effective
     degrees of freedom, which is the sum of the elements of another
     vector returned, 'effective.df.diagonal', minus one for the
     intercept. 'Sigma' is the penalized maximum likelihood estimate
     (see below).

_A_u_t_h_o_r(_s):

     Frank Harrell
      Department of Biostatistics, Vanderbilt University
      f.harrell@vanderbilt.edu

_S_e_e _A_l_s_o:

     'Design', 'Design.trans', 'anova.Design', 'summary.Design',
     'predict.Design', 'fastbw', 'validate', 'calibrate',
     'plot.Design',  'specs.Design', 'cph', 'lrm', 'which.influence',
     'lm', 'summary.lm', 'print.ols', 'residuals.ols', 'latex.ols',
     'na.delete', 'na.detail.response', 'naresid', 'datadist',
     'pentrace', 'vif', 'abs.error.pred'

_E_x_a_m_p_l_e_s:

     set.seed(1)
     x1 <- runif(200)
     x2 <- sample(0:3, 200, TRUE)
     distance <- (x1 + x2/3 + rnorm(200))^2
     d <- datadist(x1,x2)
     options(datadist="d")   # No d -> no summary, plot without giving all details

     f <- ols(sqrt(distance) ~ rcs(x1,4) + scored(x2), x=TRUE)
     # could use d <- datadist(f); options(datadist="d") at this point,
     # but predictor summaries would not be stored in the fit object for
     # use with plot.Design, summary.Design.  In that case, the original
     # dataset or d would need to be accessed later, or all variable values
     # would have to be specified to summary, plot
     anova(f)
     which.influence(f)
     summary(f)
     summary.lm(f)    # will only work if penalty and penalty.matrix not used

     # Fit a complex model and approximate it with a simple one
     x1 <- runif(200)
     x2 <- runif(200)
     x3 <- runif(200)
     x4 <- runif(200)
     y <- x1 + x2 + rnorm(200)
     f    <- ols(y ~ rcs(x1,4) + x2 + x3 + x4)
     pred <- fitted(f)   # or predict(f) or f$linear.predictors
     f2   <- ols(pred ~ rcs(x1,4) + x2 + x3 + x4, sigma=1)
     # sigma=1 prevents numerical problems resulting from R2=1
     fastbw(f2, aics=100000)
     # This will find the best 1-variable model, best 2-variable model, etc.
     # in predicting the predicted values from the original model
     options(datadist=NULL)

