Overview               package:Design               R Documentation

_O_v_e_r_v_i_e_w _o_f _D_e_s_i_g_n _L_i_b_r_a_r_y

_D_e_s_c_r_i_p_t_i_o_n:

     Design does regression modeling, testing, estimation, validation,
     graphics, prediction, and typesetting by storing enhanced model
     design attributes in the fit.

     Design is a collection of about 180 functions that assist and
     streamline modeling, especially for biostatistical and
     epidemiologic applications.  It also contains new functions for
     binary and ordinal logistic regression models and the
     Buckley-James multiple regression model for right-censored
     responses, and implements penalized maximum likelihood estimation
     for logistic and ordinary linear models.  Design works with almost
     any regression model, but it was especially written to work with
     logistic regression, Cox regression, accelerated failure time
     models, ordinary linear models, and the Buckley-James model. You
     should install the Hmisc library before using Design, as a few of
     Design's options use Hmisc functions, and Hmisc has several
     functions useful for data analysis (especially data reduction and
     imputation).

_D_e_t_a_i_l_s:

     To make use of automatic typesetting features you must have LaTeX
     or one of its variants installed.

     Some aspects of Design (e.g., 'latex') will not work correctly if
     'options(contrasts=)' other than 'c("contr.treatment",
     "contr.poly")' are used.

     Design relies on a wealth of survival analysis functions written
     by Terry Therneau of Mayo Clinic. Front-ends have been written for
     several of Therneau's functions, and other functions have been
     slightly modified.

_S_t_a_t_i_s_t_i_c_a_l _M_e_t_h_o_d_s _I_m_p_l_e_m_e_n_t_e_d:

        *  Ordinary linear regression models

        *  Binary and ordinal logistic models (proportional odds and
           continuation ratio models)

        *  Cox model

        *  Parametric survival models in the accelerated failure time
           class

        *  Buckley-James least-squares linear regression model with
           possibly right-censored responses

        *  Bootstrap model validation to obtain unbiased estimates of
           model performance without requiring a separate validation
           sample

        *  Automatic Wald tests of all effects in the model that are
           not parameterization-dependent (e.g., tests of nonlinearity
           of main effects when the variable does not interact with
           other variables, tests of nonlinearity of interaction
           effects, tests for whether a predictor is important, either
           as a main effect or as an effect modifier)

        *  Graphical depictions of model estimates (effect plots,
           odds/hazard ratio plots, nomograms that allow model
           predictions to be obtained manually even when there are
           nonlinear effects and interactions in the model)

        *  Various smoothed residual plots, including some new residual
           plots for verifying ordinal logistic model assumptions

        *  Composing S functions to evaluate the linear predictor
           (X*beta hat), hazard function, survival function, quantile
           functions analytically from the fitted model

        *  Typesetting of fitted model using LaTeX

        *  Robust covariance matrix estimation (Huber or bootstrap)

        *  Cubic regression splines with linear tail restrictions
           (natural splines)

        *  Tensor splines

        *  Interactions restricted to not be doubly nonlinear

        *  Penalized maximum likelihood estimation for ordinary linear
           regression and logistic regression models. Different parts
           of the model may be penalized by different amounts, e.g.,
           you may want to penalize interaction or nonlinear effects
           more than main effects or linear effects

        *  Estimation of hazard or odds ratios in presence of
           nonlinearity and interaction

        *  Sensitivity analysis for an unmeasured binary confounder in
           a binary logistic model

        *  Multiple imputation of repeated measures data with non-
           random dropout using propensity score matching
           (experimental, not yet functional)

_M_o_t_i_v_a_t_i_o_n:

     Design was motivated by the following needs:

        *  need to automatically print interesting Wald tests that can
           be constructed from the design

           *  tests of linearity with respect to each predictor

           *  tests of linearity of interactions

           *  pooled interaction tests (e.g., all interactions
              involving race)

           *  pooled tests of effects with higher order effects

              *  test of main effect not meaningful when effect in
                 interaction

              *  pooled test of main effect + interaction effect is
                 meaningful

              *  test of 2nd-order interaction + any 3rd-order
                 interaction containing those factors is meaningful


        *  need to store transformation parameters with the fit

           *  example: knot locations for spline functions

           *  these are "remembered" when getting predictions, unlike
              standard S or R

           *  for categorical predictors, save levels so that same
              dummy variables will be generated for predictions; check
              that all levels in out-of-data predictions were present
              when model was fitted


        *  need for uniform re-insertion of observations deleted
           because of NAs when using 'predict' without 'newdata' or
           when using 'resid'

        *  need to easily plot the regression effect of any predictor

           *  example: age is represented by a linear spline with knots
              at 40 and 60y plot effect of age on log odds of disease,
              adjusting interacting factors to easily specified
              constants

           *  vary 2 predictors: plot x1 on x-axis, separate curves for
              discrete x2 or 3d perspective plot for continuous x2

           *  if predictor is represented as a function in the model,
              plots should be with respect to the original variable:
               'f <- lrm(y ~ log(cholesterol)+age)' 
               'plot(f, cholesterol=NA)   # cholesterol on x-axis,
              default range'


        *  need to store summary of distribution of predictors with the
           fit

           *  plotting limits (default: 10th smallest, 10th largest
              values or %-tiles)

           *  effect limits   (default: .25 and .75 quantiles for
              continuous vars.)

           *  adjustment values for other predictors (default: median
              for continuous predictors, most frequent level for
              categorical ones)

           *  discrete numeric predictors: list of possible values
              example: x=0,1,2,3,5 -> by default don't plot prediction
              at x=4

           *  values are on the inner-most variable, e.g. cholesterol,
              not log(chol.)

           *  allows estimation/plotting long after original dataset
              has been deleted

           *  for Cox models, underlying survival also stored with fit,
              so original data not needed to obtain predicted survival
              curves


        *  need to automatically print estimates of effects in presence
           of non- linearity and interaction

           *  example: age is quadratic, interacting with sex default
              effect is inter-quartile-range hazard ratio (for Cox
              model), for sex=reference level

           *  user-controlled effects: 'summary(fit, age=c(30,50),
              sex="female")' -> odds ratios for logistic model,
              relative survival time for accelerated failure time
              survival models

           *  effects for all variables (e.g. odds ratios) may be
              plotted with multiple-confidence-level bars


        *  need for prettier and more concise effect names in
           printouts, especially for expanded nonlinear terms and
           interaction terms

           *  use inner-most variable name to identify predictors

           *  e.g. for 'pmin(x^2-3,10)' refer to factor with legal
              S-name 'x'


        *  need to recognize that an intercept is not always a simple
           concept

           *  some models (e.g., Cox) have no intercept

           *  some models (e.g., ordinal logistic) have multiple
              intercepts


        *  need for automatic high-quality printing of fitted
           mathematical model (with dummy variables defined, regression
           spline terms simplified, interactions "factored").  Focus is
           on regression splines instead of nonparametric smoothers or
           smoothing splines, so that explicit formulas for fit may be
           obtained for use outside S. Design can also compose S
           functions to evaluate X*Beta from the fitted model
           analytically, as well as compose SAS code to do this.

        *  need for automatic drawing of nomogram to represent the
           fitted model

        *  need for automatic bootstrap validation of a fitted model,
           with only one S command (with respect to calibration and
           discrimination)

        *  need for robust (Huber sandwich) estimator of covariance
           matrix, and be able to do all other analysis (e.g., plots,
           C.L.) using the adjusted covariances

        *  need for robust (bootstrap) estimator of covariance matrix,
           easily used in other analyses without change

        *  need for Huber sandwich and bootstrap covariance matrices
           adjusted for cluster sampling

        *  need for routine reporting of how many observations were
           deleted by missing values on each predictor (see 'na.delete'
           in Hmisc)

        *  need for optional reporting of descriptive statistics for Y
           stratified by missing status of each X (see
           na.detail.response)

        *  need for pretty, annotated survival curves, using the same
           commands for parametric and Cox models

        *  need for ordinal logistic model (proportional odds model,
           continuation ratio model)

_F_i_t_t_i_n_g _F_u_n_c_t_i_o_n_s _C_o_m_p_a_t_i_b_l_e _w_i_t_h _D_e_s_i_g_n:

     Design will work with a wide variety of fitting functions, but it
     is meant especially for the following:

       *Function*  *Purpose*                               *Related S*
                                                           *Functions*
       *'ols'*     Ordinary least squares linear model     'lm'
       *'lrm'*     Binary and ordinal logistic regression  'glm'
                   model                                   'cr.setup'
       *'psm'*     Accelerated failure time parametric     'survreg'
                   survival model                          
       *'cph'*     Cox proportional hazards regression     'coxph'
       *'bj'*      Buckley-James censored least squares    'survreg'
                   linear model                            
       *'glmD'*    Version of glm for use with Design      
       *'glsD'*    Version of gls for use with Design      

_M_e_t_h_o_d_s _i_n _D_e_s_i_g_n:

     The following generic functions work with fits with Design in
     effect:

       *Function*           *Purpose*                                 *Related*
                                                                      *Functions*
       *'print'*            Print parameters and statistics of fit    
       *'coef'*             Fitted regression coefficients            
       *'formula'*          Formula used in the fit                   
       *'specs'*            Detailed specifications of fit            
       *'robcov'*           Robust covariance matrix estimates        
       *'bootcov'*          Bootstrap covariance matrix estimates     
       *'summary'*          Summary of effects of predictors          
       *'plot.summary'*     Plot continuously shaded confidence       
                            bars for results of summary               
       *'anova'*            Wald tests of most meaningful hypotheses  
       *'contrast'*         General contrasts, C.L., tests            
       *'plot.anova'*       Depict results of anova graphically       'dotchart'
       *'plot'*             Plot effects of predictors                
       *'gendata'*          Generate data frame with predictor        'expand.grid'
                            combinations (optionally interactively)   
       *'predict'*          Obtain predicted values or design matrix  
       *'fastbw'*           Fast backward step-down variable          'step'
                            selection                                 
       *'residuals'*        Residuals, influence statistics from fit  
       (or *'resid'*)                                                 
       *'which.influence'*  Which observations are overly             'residuals'
                            influential                               
       *'sensuc'*           Sensitivity of one binary predictor in    
                            lrm and cph models to an unmeasured       
                            binary confounder                         
       *'latex'*            LaTeX representation of fitted            
                            model or 'anova' or 'summary' table       
       *'Function'*         S function analytic representation        'Function.transcan'
                            of a fitted regression model (X*Beta)     
       *'hazard'*           S function analytic representation        'rcspline.restate'
                            of a fitted hazard function (for 'psm')   
       *'Survival'*         S function analytic representation of     
                            fitted survival function (for 'psm,cph')  
       *'Quantile'*         S function analytic representation of     
                            fitted function for quantiles of          
                            survival time (for 'psm, cph')            
       *'nomogram'*         Draws a nomogram for the fitted model     'latex, plot'
       *'survest'*          Estimate survival probabilities           'survfit'
                            (for 'psm, cph')                          
       *'survplot'*         Plot survival curves (psm, cph)           plot.survfit
       *'validate'*         Validate indexes of model fit using       val.prob
                            resampling                                
       *'calibrate'*        Estimate calibration curve for model      
                            using resampling                          
       *'vif'*              Variance inflation factors for a fit      
       *'naresid'*          Bring elements corresponding to missing   
                            data back into predictions and residuals  
       *'naprint'*          Print summary of missing values           
       *'pentrace'*         Find optimum penality for penalized MLE   
       *'effective.df'*     Print effective d.f. for each type of     
                            variable in model, for penalized fit or   
                            pentrace result                           
       *'rm.impute'*        Impute repeated measures data with        'transcan',
                            non-random dropout                        'fit.mult.impute'
                            _experimental, non-functional_            

_B_a_c_k_g_r_o_u_n_d _f_o_r _E_x_a_m_p_l_e_s:

     The following programs demonstrate how the pieces of the Design
     package work together.  A (usually) one-time call to the function
     'datadist' requires a pass at the entire data frame to store
     distribution summaries for potential predictor variables.  These
     summaries contain (by default) the .25 and .75 quantiles of
     continuous variables (for estimating effects such as odds ratios),
     the 10th smallest and 10th largest values (or .1 and .9 quantiles
     for small n) for plotting ranges for estimated curves, and the
     total range.  For discrete numeric variables (those having <=10
     unique values), the list of unique values is also stored.  Such
     summaries are used by the 'summary.Design, plot.Design', and
     'nomogram.Design' functions.  You may save time and defer running
     'datadist'.  In that case, the distribution summary is not stored
     with the fit object, but it can be gathered before running
     'summary' or 'plot'.

     'd <- datadist(my.data.frame) # or datadist(x1,x2)'
      'options(datadist="d")        # omit this or use
     options(datadist=NULL)'
      '                             # if not run datadist yet'
      'cf <- ols(y ~ x1 * x2)'
      'anova(f)'
      'fastbw(f)'
      'predict(f, newdata)'

     In the *Examples* section there are three detailed examples using
     a fitting function  designed to be used with Design, 'lrm'
     (logistic regression model).  In *Detailed Example 1* we create 3
     predictor variables and a two binary response on 500 subjects. 
     For the first binary response, 'dz', the true model involves only
     'sex' and 'age', and there is a nonlinear interaction between the
     two because the log odds is a truncated linear relationship in
     'age' for females and a quadratic function for males.  For the
     second binary outcome, 'dz.bp', the true population model also
     involves systolic blood pressure ('sys.bp') through a truncated
     linear relationship.  First, nonparametric estimation of
     relationships is done using the Hmisc library's 'plsmo' function
     which uses 'lowess' with outlier detection turned off for binary
     responses.  Then parametric modeling is done using restricted
     cubic splines.  This modeling does not assume that we know the
     true transformations for 'age' or 'sys.bp' but that these
     transformations are smooth (which is not actually the case in the
     population).

     For *Detailed Example 2*, suppose that a categorical variable
     treat has values '"a", "b"', and '"c"', an ordinal variable
     'num.diseases' has values 0,1,2,3,4, and that there are two
     continuous variables, 'age' and 'cholesterol'.  'age' is fitted
     with a restricted cubic spline, while 'cholesterol' is transformed
     using the transformation 'log(cholesterol - 10)'.  Cholesterol is
     missing on three subjects, and we impute these using the overall
     median cholesterol.  We wish to allow for interaction between
     'treat' and 'cholesterol'.  The following S program will fit a
     logistic model, test all effects in the design, estimate effects,
     and plot estimated transformations. The fit for 'num.diseases'
     really considers the variable to be a 5-level categorical
     variable. The only difference is that a 3 d.f. test of linearity
     is done to assess whether the variable can be re-modeled "asis". 
     Here we also show statements to attach the Design library and
     store predictor characteristics from datadist.

     *Detailed Example 3* shows some of the survival analysis
     capabilities of Design related to the Cox proportional hazards
     model.  We simulate data for 2000 subjects with 2 predictors,
     'age' and 'sex'.  In the true population model, the log hazard
     function is linear in 'age' and there is no 'age' x 'sex'
     interaction.  In the  analysis below we do not make use of the
     linearity in age.  Design makes use of many of Terry Therneau's
     survival functions that are builtin to S.

     The following is a typical sequence of steps that would be used
     with Design in conjunction with the Hmisc 'transcan' function to
     do single imputation of all NAs in the predictors (multiple
     imputation would be better but would be harder to do in the
     context of bootstrap model validation), fit a model, do backward
     stepdown to reduce the number of predictors in the model (with all
     the severe problems this can entail), and use the bootstrap to
     validate this stepwise model, repeating the variable selection for
     each re-sample.  Here we take a short cut as the imputation is not
     repeated within the bootstrap.

     In what follows we (atypically) have only 3 candidate predictors. 
     In practice be sure to have the validate and calibrate functions
     operate on a model fit that contains all predictors that were
     involved in previous analyses that used the response variable. 
     Here the imputation is necessary because backward stepdown would
     otherwise delete observations missing on any candidate variable.

     Note that you would have to define 'x1, x2, x3, y' to run the
     following code.

     'xt <- transcan(~ x1 + x2 + x3, imputed=TRUE)'
      'impute(xt)  # imputes any NAs in x1, x2, x3'
      '# Now fit original full model on filled-in data'
      'f <- lrm(y ~ x1 + rcs(x2,4) + x3, x=TRUE, y=TRUE) #x,y allow
     boot.'
      'fastbw(f)'
      '# derives stepdown model (using default stopping rule)'
      'validate(f, B=100, bw=TRUE) # repeats fastbw 100 times'
      'cal <- calibrate(f, B=100, bw=TRUE)  # also repeats fastbw'
      'plot(cal)'

_C_o_m_m_o_n _P_r_o_b_l_e_m_s _t_o _A_v_o_i_d:

        1.  Don't have a formula like 'y ~ age + age^2'. In S you need
           to connect related variables using a function which produces
           a matrix, such as 'pol' or 'rcs'. This allows effect
           estimates (e.g., hazard ratios) to be computed as well as
           multiple d.f. tests of association.

        2.  Don't use 'poly' or 'strata' inside formulas used in
           Design.  Use 'pol' and 'strat' instead.

        3.  Almost never code your own dummy variables or interaction
           variables in S.  Let S do this automatically.  Otherwise,
           'anova' can't do its job.

        4.  Almost never transform predictors outside of the model
           formula, as then plots of predicted values vs. predictor
           values, and other displays, would not be made on the
           original scale.  Use instead something like 'y ~
           log(cell.count+1)', which will allow 'cell.count' to appear
           on x-axes.  You can get fancier, e.g., 'y ~
           rcs(log(cell.count+1),4)' to fit a restricted cubic spline
           with 4 knots in 'log(cell.count+1)'. For more complex
           transformations do something like 
            'f <- function(x) {'
            '... various 'if' statements, etc.'
            'log(pmin(x,50000)+1)'
            '}'
            'fit1 <- lrm(death ~ f(cell.count))'
            'fit2 <- lrm(death ~ rcs(f(cell.count),4))'
            '}'

        5.  Don't put '$' inside variable names used in formulas.
           Either attach data frames or use 'data='.

        6.  Don't forget to use 'datadist'.  Try to use it at the top
           of your program so that all model fits can automatically
           take advantage if its distributional summaries for the
           predictors.

        7.  Don't 'validate' or 'calibrate' models which were reduced
           by dropping "insignificant" predictors. Proper bootstrap or
           cross-validation must repeat any variable selection steps
           for each re-sample. Therefore, 'validate' or 'calibrate'
           models which contain all candidate predictors, and if you
           must reduce models, specify the option 'bw=TRUE' to
           'validate' or 'calibrate'.

        8.  Dropping of "insignificant" predictors ruins much of the
           usual statistical inference for regression models
           (confidence limits, standard errors, P-values, chi-squares,
           ordinary indexes of model performance) and it also results
           in models which will have worse predictive discrimination.

_A_c_c_e_s_s_i_n_g _t_h_e _L_i_b_r_a_r_y:

     If you are using any of Design's survival analysis functions,
     create a file called '.Rprofile' in your working directory that
     contains the line 'library(survival)'.  That way, survival will
     move down the search list as Hmisc and Design are attached during
     your session.   This will allow Hmisc and Design to override some
     of the survival function such as 'survfit'.

     Since the Design library has a '.First.lib' function, that
     function will be executed by the 'library' command, to dynamically
     load the '.o' or '.obj' files.  You may want to create a '.First'
     function such as

     '.First <- {'
      'options(na.action = "na.delete")'
      '# gives more info than na.omit'
      'library(Hmisc)'
      'library(Design)'
      'invisible()'
      '}'

_P_u_b_l_i_s_h_e_d _A_p_p_l_i_c_a_t_i_o_n_s _o_f _D_e_s_i_g_n _a_n_d _R_e_g_r_e_s_s_i_o_n _S_p_l_i_n_e_s:

        *  Spline fits

           1.  Spanos A, Harrell FE, Durack DT (1989): Differential
              diagnosis of acute meningitis: An analysis of the
              predictive value of initial observations.  _JAMA_
              2700-2707.

           2.  Ohman EM, Armstrong PW, Christenson RH, _et al_. (1996):
              Cardiac troponin T levels for risk stratification in
              acute myocardial ischemia.  _New Eng J Med_
              335:1333-1341.


        *  Bootstrap calibration curve for a parametric survival model:

           1.  Knaus WA, Harrell FE, Fisher CJ, Wagner DP, _et al_.
              (1993):  The clinical evaluation of new drugs for sepsis:
              A prospective study design based on survival analysis. 
              _JAMA_ 270:1233-1241.


        *  Splines, interactions with splines, algebraic form of fitted
           model from 'latex.Design'

           1.  Knaus WA, Harrell FE, Lynn J, et al. (1995): The SUPPORT
              prognostic model: Objective estimates of survival for
              seriously ill hospitalized adults.  _Annals of Internal
              Medicine_ 122:191-203.


        *  Splines, odds ratio chart from fitted model with nonlinear
           and interaction terms, use of 'transcan' for imputation

           1.  Lee KL, Woodlief LH, Topol EJ, Weaver WD, Betriu A. Col
              J, Simoons M, Aylward P, Van de Werf F, Califf RM.
              Predictors of 30-day mortality in the era of reperfusion
              for acute myocardial infarction: results from an
              international trial of 41,021 patients. _Circulation_
              1995;91:1659-1668.


        *  Splines, external validation of logistic models, prediction
           rules using point tables

           1.  Steyerberg EW, Hargrove YV, _et al_ (2001): Residual
              mass histology in testicular cancer: development and
              validation of a clinical prediction rule.  _Stat in Med_
              2001;20:3847-3859.

           2.  van Gorp MJ, Steyerberg EW, _et al_ (2003): Clinical
              prediction rule for 30-day mortality in Bjork-Shiley
              convexo-concave valve replacement.  _J Clinical
              Epidemiology_ 2003;56:1006-1012.


        *  Model fitting, bootstrap validation, missing value
           imputation

           1.  Krijnen P, van Jaarsveld BC, Steyerberg EW, Man in 't
              Veld AJ, Schalekamp, MADH, Habbema JDF (1998): A clinical
              prediction  rule for renal artery stenosis. _Annals of
              Internal Medicine_ 129:705-711.


        *  Model fitting, splines, bootstrap validation, nomograms

           1.  Kattan MW, Eastham JA, Stapleton AMF, Wheeler TM,
              Scardino PT.  A preoperative nomogram for disease
              recurrence following radical prostatectomy for prostate
              cancer.  _J Natl Ca Inst_ 1998; 90(10):766-771.

           2.  Kattan, MW, Wheeler TM, Scardino PT.  A postoperative
              nomogram for disease recurrence following radical
              prostatectomy for prostate cancer. _J Clin Oncol_ 1999;
              17(5):1499-1507

           3.  Kattan MW, Zelefsky MJ, Kupelian PA, Scardino PT,  Fuks
              Z, Leibel SA.  A pretreatment nomogram for predicting the
              outcome of three-dimensional conformal radiotherapy in
              prostate cancer.   _J Clin Oncol_ 2000; 18(19):3252-3259.

           4.  Eastham JA, May R, Robertson JL, Sartor O, Kattan MW. 
              Development of a nomogram which predicts the probability
              of a positive prostate biopsy in men with an abnormal
              digital rectal examination and a prostate specific
              antigen between 0 and 4 ng/ml. _Urology_. (In press).

           5.  Kattan MW, Heller G, Brennan MF.  A competing-risk
              nomogram fir sarcoma-specific death following local
              recurrence. _Stat in Med_ 2003; 22; 3515-3525.


        *  Nomogram with 2- and 5-year survival probability and median
           survival time (but watch out for the use of univariable
           screening)

           1.  Clark TG, Stewart ME, Altman DG, Smyth JF.  A prognostic
              model for ovarian cancer.  _Br J Cancer_ 2001; 85:944-52.


        *  Comprehensive example of parametric survival modeling with
           an extensive nomogram, time ratio chart, anova chart,
           survival curves generated using survplot, bootstrap
           calibration curve

           1.  Teno JM, Harrell FE, Knaus WA, et al.  Prediction of
              survival for older hospitalized patients: The HELP
              survival model.  _J Am Geriatrics Soc_ 2000; 48: S16-S24.


        *  Model fitting, imputation, and several nomograms expressed
           in tabular form

           1.  Hasdai D, Holmes DR, et al.  Cardiogenic shock
              complicating acute myocardial infarction: Predictors of
              death. _Am Heart J_ 1999; 138:21-31.


        *  Ordinal logistic model with bootstrap calibration plot

           1.  Wu AW, Yasui U, Alzola CF _et al_.  Predicting
              functional status outcomes in hospitalized patients aged
              80 years and older.  _J Am Geriatric Society_ 2000;
              48:S6-S15.


        *  Propensity modeling in evaluating medical diagnosis, anova
           dot chart

           1.  Weiss JP, Gruver C, et al.  Ordering an echocardiogram 
              for evaluation of left ventricular function: Level of
              expertise necessary for efficient use. _J Am Soc 
              Echocardiography_ 2000; 13:124-130.


        *  Simulations using Design to study the properties of various
           modeling strategies

           1.  Steyerberg EW, Eijkemans MJC, Habbema JDF.  Stepwise
              selection in small data sets: A simulation study of bias
              in logistic regression analysis.  _J Clin Epi_ 1999;
              52:935-942.

           2.  Steyerberg WE, Eijekans MJC, Harrell FE, Habbema JDF.
              Prognostic modeling with logistic regression analysis: In
              search of a sensible strategy in small data sets.  _Med
              Decision Making_ 2001; 21:45-56.


        *  Statistical methods and references related to Design, along
           with case studies which includes the Design code which
           produced the analyses

           1.  Harrell FE, Lee KL, Mark DB (1996): Multivariable
              prognostic models: Issues in developing models,
              evaluating assumptions and adequacy, and measuring and
              reducing errors.  _Stat in Med_ 15:361-387.

           2.  Harrell FE, Margolis PA, Gove S, Mason KE, Mulholland EK
              et al. (1998): Development of a clinical prediction model
              for an ordinal outcome: The World Health Organization ARI
              Multicentre Study of clinical signs and etiologic agents
              of pneumonia, sepsis, and meningitis in young infants.
              _Stat in Med_ 17:909-944.

           3.  Bender R, Benner, A (2000): Calculating ordinal
              regression models in SAS and S-Plus.  _Biometrical J_
              42:677-699.

_B_u_g _R_e_p_o_r_t_s:

     The author is willing to help with problems.  Send E-mail to
     f.harrell@vanderbilt.edu.  To report bugs, please do the
     following:

        1.  If the bug occurs when running a function on a fit object
           (e.g., 'anova'), attach a 'dump''d text version of the fit
           object to your note.  If you used 'datadist' but not until
           after the fit was created, also send the object created by
           'datadist'.  Example: 'dump("myfit","/tmp/dumpdata")' will
           create a text file called '"dumpdata"' that can be attached
           to the E-mail.  

        2.  If the bug occurs during a model fit (e.g., with 'lrm, ols,
           psm, cph'), send the statement causing the error with a
           'dump''d version of the data frame used in the fit.  If this
           data frame is very large, reduce it to a small subset which
           still causes the error.

_C_o_p_y_r_i_g_h_t _N_o_t_i_c_e:

     GENERAL DISCLAIMER  This program is free software; you can
     redistribute it and/or modify it under the terms of the GNU
     General Public License as published by the Free Software
     Foundation; either version 2, or (at your option) any later
     version.

     This program is distributed in the hope that it will be useful,
     but WITHOUT ANY WARRANTY; without even the implied warranty of
     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
     General Public License for more details.   In short: you may use
     this code any way you like, as long as you don't charge money for
     it, remove this notice, or hold anyone liable for its results. 
     Also, please acknowledge the source and communicate changes to the
     author.

     If this software is used is work presented for publication, kindly
     reference it using for example: Harrell FE (2003): Design: S
     functions for biostatistical/epidemiologic modeling, testing,
     estimation, validation, graphics, and prediction. Programs
     available from <URL: biostat.mc.vanderbilt.edu/s/Design.html>. Be
     sure to reference other libraries used as well as S-Plus or R
     itself.

_A_c_k_n_o_w_l_e_d_g_e_m_e_n_t_s:

     This work was supported by grants from the Agency for Health Care
     Policy and Research (US Public Health Service) and the Robert Wood
     Johnson Foundation.

_A_u_t_h_o_r(_s):

     Frank E Harrell Jr
      Professor of Biostatistics
      Chair, Department of Biostatistics
      Vanderbilt University School of Medicine
      Nashville, Tennessee
      f.harrell@vanderbilt.edu

_R_e_f_e_r_e_n_c_e_s:

     The primary resource for the Design library is _Regression
     Modeling Strategies_ by FE Harrell (Springer-Verlag, 2001) and the
     web pages <URL: http://biostat.mc.vanderbilt.edu/rms> and <URL:
     http://biostat.mc.vanderbilt.edu/s/Design.html>.  See also the
     Statistics in Medicine articles by Harrell _et al_ listed below
     for case studies of modeling and model validation using Design.
     Also see the free book by Alzola and Harrell at <URL:
     http://biostat.mc.vanderbilt.edu>.

     Several datasets useful for multivariable modeling with Design are
     found at <URL: http://biostat.mc.vanderbilt.edu/s/data>.

_E_x_a_m_p_l_e_s:

     ######################
     # Detailed Example 1 #
     ######################
     # May want to first invoke the Hmisc store function
     # so that new variables will go into a temporary directory
     set.seed(17)  # So can repeat random number sequence
     n <- 500

     sex    <- factor(sample(c('female','male'), n, rep=TRUE))
     age    <- rnorm(n, 50, 10)
     sys.bp <- rnorm(n, 120, 7)

     # Use two population models, one with a systolic
     # blood pressure effect and one without

     L    <- ifelse(sex=='female', .1*(pmin(age,50)-50), .005*(age-50)^2)
     L.bp <- L + .4*(pmax(sys.bp,120)-120)

     dz    <- ifelse(runif(n) <= plogis(L),    1, 0)
     dz.bp <- ifelse(runif(n) <= plogis(L.bp), 1, 0)

     # Use summary.formula in the Hmisc library to summarize the
     # data one predictor at a time

     s <- summary(dz.bp ~ age + sex + sys.bp) 
     options(digits=3)
     print(s)
     plot(s)

     plsmo(age, dz, group=sex, fun=qlogis, ylim=c(-3,3))
     plsmo(age, L,  group=sex, method='raw', add=TRUE, prefix='True', trim=0)
     title('Lowess-smoothed Estimates with True Regression Functions')

     dd <- datadist(age, sex, sys.bp)
     options(datadist='dd')
     # can also do: dd <- datadist(dd, newvar)

     f <- lrm(dz ~ rcs(age,5)*sex, x=TRUE, y=TRUE)
     f
     # x=TRUE, y=TRUE for pentrace

     fpred <- Function(f)
     fpred
     fpred(age=30, sex=levels(sex))

     anova(f)

     p <- plot(f, age=NA, sex=NA, conf.int=FALSE, ylim=c(-3,3))
     datadensity(p, age, sex)
     scat1d(age)

     plsmo(age, L, group=sex, method='raw', add=TRUE, prefix='True', trim=0)
     title('Spline Fits with True Regression Functions')

     f.bp <- lrm(dz.bp ~ rcs(age,5)*sex + rcs(sys.bp,5))

     for(method in c('persp','image')) 
       p <- plot(f.bp, age=NA, sys.bp=NA, method=method)
     # Legend(p)   # NOTE: Needs subplot - not in R

     cat('Doing 25 bootstrap repetitions to validate model\n')
     validate(f, B=25)   # in practice try to use 150

     cat('Doing 25 bootstrap reps to check model calibration\n')
     cal <- calibrate(f, B=25)   # use 150 in practice
     plot(cal)
     title('Calibration of Unpenalized Model')

     p <- if(.R.) pentrace(f, penalty=c(.009,.009903,.02,.2,.5,1)) else
                  pentrace(f, penalty=1, method='optimize')

     f <- update(f, penalty=p$penalty)
     f
     specs(f,long=TRUE)
     edf <- effective.df(f)

     p <- plot(f, age=NA, sex=NA, conf.int=FALSE, ylim=c(-3,3))
     datadensity(p, age, sex)
     scat1d(age)

     plsmo(age, L, group=sex, method='raw', add=TRUE, prefix='True', trim=0)
     title('Penalized Spline Fits with True Regression Functions')

     options(digits=3)
     s <- summary(f)
     s
     plot(s)

     s <- summary(f, sex='male')
     plot(s)

     fpred <- Function(f)
     fpred
     fpred(age=30, sex=levels(sex))
     sascode(fpred)

     cat('Doing 40 bootstrap reps to validate penalized model\n')
     validate(f, B=40)

     cat('Doing 40 bootstrap reps to check penalized model calibration\n')
     cal <- calibrate(f, B=40)
     plot(cal)
     title('Calibration of Penalized Model')

     nomogram(f.bp, fun=plogis,
              funlabel='Prob(dz)',
              fun.at=c(.15,.2,.3,.4,.5,.6,.7,.8,.9,.95,.975),
              fun.side=c(1,3,1,3,1,3,1,3,1,3,1))
     options(datadist=NULL)

     #####################
     #Detailed Example 2 #
     #####################
     # Simulate the data.  
     n <- 1000    # define sample size
     set.seed(17) # so can reproduce the results
     treat <- factor(sample(c('a','b','c'), n, TRUE))
     num.diseases <- sample(0:4, n, TRUE)
     age <- rnorm(n, 50, 10)
     cholesterol <- rnorm(n, 200, 25)
     weight <- rnorm(n, 150, 20)
     sex <- factor(sample(c('female','male'), n, TRUE))
     label(age) <- 'Age'      # label is in Hmisc
     label(num.diseases) <- 'Number of Comorbid Diseases'
     label(cholesterol) <- 'Total Cholesterol'
     label(weight) <- 'Weight, lbs.'
     label(sex) <- 'Sex'
     units(cholesterol) <- 'mg/dl'   # uses units.default in Hmisc

     # Specify population model for log odds that Y=1
     L <- .1*(num.diseases-2) + .045*(age-50) +
       (log(cholesterol - 10)-5.2)*(-2*(treat=='a') +
       3.5*(treat=='b')+2*(treat=='c'))
     # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)]
     y <- ifelse(runif(n) < plogis(L), 1, 0)
     cholesterol[1:3] <- NA   # 3 missings, at random

     ddist <- datadist(cholesterol, treat, num.diseases,
                       age, weight, sex)
     # Could have used ddist <- datadist(data.frame.name)
     options(datadist="ddist") # defines data dist. to Design
     cholesterol <- impute(cholesterol) # see impute in Hmisc library
     # impute, describe, and several other basic functions are
     # distributed as part of the Hmisc library

     fit <- lrm(y ~ treat*log(cholesterol - 10) +
                scored(num.diseases) +  rcs(age))

     describe(y ~ treat + scored(num.diseases) + rcs(age))
     # or use describe(formula(fit)) for all variables used in fit
     # describe function (in Hmisc) gets simple statistics on variables
     #fit <- robcov(fit) # Would make all statistics which follow
                         # use a robust covariance matrix
                         # would need x=TRUE, y=TRUE in lrm
     specs(fit) # Describe the design characteristics
     a <- anova(fit)
     print(a, which='subscripts')          # print which parameters being tested
     plot(anova(fit)) # Depict Wald statistics graphically
     anova(fit, treat, cholesterol) # Test these 2 by themselves
     summary(fit) # Estimate effects using default ranges
     plot(summary(fit)) # Graphical display of effects with C.L.
     summary(fit, treat="b", age=60) 
     # Specify reference cell and adjustment val

     summary(fit, age=c(50,70)) # Estimate effect of increasing age from
                                # 50 to 70
     summary(fit, age=c(50,60,70)) # Increase age from 50 to 70, 
                                   # adjust to 60 when estimating 
                                   # effects of other factors
     # If had not defined datadist, would have to define
     # ranges for all var.

     # Estimate and test treatment (b-a) effect averaged
     # over 3 cholesterols
     contrast(fit, list(treat='b',cholesterol=c(150,200,250)),
                   list(treat='a',cholesterol=c(150,200,250)),
              type='average')
     # Remove type='average' to get 3 separate contrasts for b-a

     # Plot effects.  plot(fit) plots effects of all predictors,
     # showing values used for interacting factors as subtitles
     # The ref.zero parameter is helpful for showing effects of
     # predictors on a common scale for comparison of strength
     plot(fit, ref.zero=TRUE, ylim=c(-2,2))

     plot(fit, age=seq(20,80,length=100), treat=NA, conf.int=FALSE)
     # Plots relationship between age and log
     # odds, separate curve for each treat, no C.I.
     plot(fit, age=NA, cholesterol=NA)
     # 3-dimensional perspective plot for age, cholesterol, and
     # log odds using default ranges for both variables
     plot(fit, num.diseases=NA, fun=function(x) 1/(1+exp(-x)),  #or fun=plogis
          ylab="Prob", conf.int=.9)   
     # Plot estimated probabilities instead of log odds
     # Again, if no datadist were defined, would have to
     # tell plot all limits
     logit <- predict(fit, expand.grid(treat="b",num.diseases=1:3,
                      age=c(20,40,60),
                      cholesterol=seq(100,300,length=10)))
     #logit <- predict(fit, gendata(fit, nobs=12))
     # Interactively specify 12 predictor combinations using UNIX
     # For UNIX or Windows, generate 9 combinations with other variables
     # set to defaults, get predicted values
     logit <- predict(fit, gendata(fit, age=c(20,40,60),
                      treat=c('a','b','c')))

     # Since age doesn't interact with anything, we can quickly and
     # interactively try various transformations of age,
     # taking the spline function of age as the gold standard. We are
     # seeking a linearizing transformation.  Here age is linear in the
     # population so this is not very productive.  Also, if we simplify the
     # model the total degrees of freedom will be too small and
     # confidence limits too narrow

     ag <- 10:80
     logit <- predict(fit, expand.grid(treat="a",
                      num.diseases=0, age=ag,
                      cholesterol=median(cholesterol)),
                      type="terms")[,"age"]
     # Note: if age interacted with anything, this would be the age
     #               "main effect" ignoring interaction terms
     # Could also use
     #   logit <- plot(f, age=ag, ...)$x.xbeta[,2]
     # which allows evaluation of the shape for any level
     # of interacting factors.  When age does not interact with
     # anything, the result from
     # predict(f, ..., type="terms") would equal the result from
     # plot if all other terms were ignored
     # Could also use
     #   logit <- predict(fit, gendata(fit, age=ag, cholesterol=median...))

     plot(ag^.5, logit)  # try square root vs. spline transform.
     plot(ag^1.5, logit) # try 1.5 power

     # w <- latex(fit)  # invokes latex.lrm, creates fit.tex
     # print(w)         # display or print model on screen

     # Draw a nomogram for the model fit
     nomogram(fit, fun=plogis, funlabel="Prob[Y=1]")

     # Compose S function to evaluate linear predictors from fit
     g <- Function(fit)
     g(treat='b', cholesterol=260, age=50)
     # Leave num.diseases at reference value

     # Use the Hmisc dataRep function to summarize sample
     # sizes for subjects as cross-classified on 2 key
     # predictors
     drep <- dataRep(~ roundN(age,10) + num.diseases)
     print(drep, long=TRUE)

     # Some approaches to making a plot showing how
     # predicted values vary with a continuous predictor
     # on the x-axis, with two other predictors varying

     fit <- lrm(y ~ log(cholesterol - 10) + 
                num.diseases + rcs(age) + rcs(weight) + sex)

     combos <- gendata(fit, age=10:100,
                       cholesterol=c(170,200,230),
                       weight=c(150,200,250))
     # num.diseases, sex not specified -> set to mode
     # can also used expand.grid

     combos$pred <- predict(fit, combos)
     library(lattice)
     xyplot(pred ~ age | cholesterol*weight, data=combos)
     xYplot(pred ~ age | cholesterol, groups=weight,
            data=combos, type='l') # in Hmisc
     xYplot(pred ~ age, groups=interaction(cholesterol,weight),
            data=combos, type='l')

     # Can also do this with plot.Design but a single
     # plot may be busy:
     ch <- c(170, 200, 230)
     plot(fit, age=NA, cholesterol=ch, weight=150,
          conf.int=FALSE)
     plot(fit, age=NA, cholesterol=ch, weight=200,
          conf.int=FALSE, add=TRUE)
     plot(fit, age=NA, cholesterol=ch, weight=250,
          conf.int=FALSE, add=TRUE)

     #Here we use plot.Design to make 9 separate plots, with CLs
     d <- expand.grid(cholesterol=c(170,200,230),
                      weight=c(150,200,250))
     for(i in 1:nrow(d)) {
       plot(fit, age=NA, cholesterol=d$cholesterol[i],
            weight=d$weight[i])
       title(paste('Chol=',format(d$cholesterol[i]),' ',
                   'Wt=',format(d$weight[i]),sep=''))
     }
     options(datadist=NULL)

     ######################
     # Detailed Example 3 #
     ######################
     n <- 2000
     set.seed(731)
     age <- 50 + 12*rnorm(n)
     label(age) <- "Age"
     sex <- factor(sample(c('Male','Female'), n, 
                   rep=TRUE, prob=c(.6, .4)))
     cens <- 15*runif(n)
     h <- .02*exp(.04*(age-50)+.8*(sex=='Female'))
     t <- -log(runif(n))/h
     label(t) <- 'Follow-up Time'
     e <- ifelse(t<=cens,1,0)
     t <- pmin(t, cens)
     units(t) <- "Year"
     age.dec <- cut2(age, g=10, levels.mean=TRUE)
     dd <- datadist(age, sex, age.dec)
     options(datadist='dd')
     Srv <- Surv(t,e)

     # Fit a model that doesn't assume anything except
     # that deciles are adequate representations of age
     f <- cph(Srv ~ strat(age.dec)+strat(sex), surv=TRUE)
     # surv=TRUE speeds up computations, and confidence limits when
     # there are no covariables are still accurate.

     # Plot log(-log 3-year survival probability) vs. mean age
     # within age deciles and vs. sex
     plot(f, age.dec=NA, sex=NA, time=3, 
          loglog=TRUE, val.lev=TRUE, ylim=c(-5,-1))

     # Fit a model assuming proportional hazards for age and
     # absence of age x sex interaction
     f <- cph(Srv ~ rcs(age,4)+strat(sex), surv=TRUE)
     survplot(f, sex=NA, n.risk=TRUE)
     # Add ,age=60 after sex=NA to tell survplot use age=60
     # Validate measures of model performance using the bootstrap
     # First must add data (design matrix and Srv) to fit object
     f <- update(f, x=TRUE, y=TRUE)
     validate(f, B=10, dxy=TRUE, u=5)  # use t=5 for Dxy (only)
     # Use B=150 in practice
     # Validate model for accuracy of predicting survival at t=1
     # Get Kaplan-Meier estimates by divided subjects into groups
     # of size 200 (for other values of u must put time.inc=u in
     # call to cph)
     cal <- calibrate(f, B=10, u=1, m=200)  # B=150 in practice
     plot(cal)
     # Check proportional hazards assumption for age terms
     z <- cox.zph(f, 'identity')
     print(z); plot(z)

     # Re-fit this model without storing underlying survival
     # curves for reference groups, but storing raw data with
     # the fit (could also use f <- update(f, surv=FALSE, x=TRUE, y=TRUE))
     f <- cph(Srv ~ rcs(age,4)+strat(sex), x=TRUE, y=TRUE) 
     # Get accurate C.L. for any age
     # Note: for evaluating shape of regression, we would not ordinarily
     # bother to get 3-year survival probabilities - would just use X * beta
     # We do so here to use same scale as nonparametric estimates
     f
     anova(f)
     ages <- seq(20, 80, by=4)   # Evaluate at fewer points. Default is 100
                                 # For exact C.L. formula n=100 -> much memory
     plot(f, age=ages, sex=NA, time=3, loglog=TRUE, ylim=c(-5,-1))

     # Fit a model assuming proportional hazards for age but
     # allowing for general interaction between age and sex
     f <- cph(Srv ~ rcs(age,4)*strat(sex), x=TRUE, y=TRUE)
     anova(f)
     ages <- seq(20, 80, by=6)   
     # Still fewer points - more parameters in model

     # Plot 3-year survival probability (log-log and untransformed)
     # vs. age and sex, obtaining accurate confidence limits
     plot(f, age=ages, sex=NA, time=3, loglog=TRUE, ylim=c(-5,-1))
     plot(f, age=ages, sex=NA, time=3)
     # Having x=TRUE, y=TRUE in fit also allows computation of influence stats
     r <- resid(f, "dfbetas")
     which.influence(f)
     # Use survest to estimate 3-year survival probability and
     # confidence limits for selected subjects
     survest(f, expand.grid(age=c(20,40,60), sex=c('Female','Male')),
             times=c(2,4,6), conf.int=.95)

     # Create an S function srv that computes fitted
     # survival probabilities on demand, for non-interaction model
     f <- cph(Srv ~ rcs(age,4)+strat(sex), surv=TRUE)
     srv <- Survival(f)
     # Define functions to compute 3-year estimates as a function of
     # the linear predictors (X*Beta)
     surv.f <- function(lp) srv(3, lp, stratum="sex=Female")
     surv.m <- function(lp) srv(3, lp, stratum="sex=Male")
     # Create a function that computes quantiles of survival time
     # on demand
     quant <- Quantile(f)
     # Define functions to compute median survival time
     med.f <- function(lp) quant(.5, lp, stratum="sex=Female")
     med.m <- function(lp) quant(.5, lp, stratum="sex=Male")
     # Draw a nomogram to compute several types of predicted values
     nomogram(f, fun=list(surv.m, surv.f, med.m, med.f),
              funlabel=c("S(3 | Male)","S(3 | Female)",
                         "Median (Male)","Median (Female)"),
              fun.at=list(c(.8,.9,.95,.98,.99),c(.1,.3,.5,.7,.8,.9,.95,.98),
                        c(8,12),c(1,2,4,8,12)))
     options(datadist=NULL)

     ########################################################
     # Simple examples using small datasets for checking    #
     # calculations across different systems in which random#
     # number generators cannot be synchronized.            #
     ########################################################

     x1 <- 1:20
     x2 <- abs(x1-10)
     x3 <- factor(rep(0:2,length.out=20))
     y  <- c(rep(0:1,8),1,1,1,1)
     dd <- datadist(x1,x2,x3)
     options(datadist='dd')
     f  <- lrm(y ~ rcs(x1,3) + x2 + x3)
     f
     specs(f, TRUE)
     anova(f)
     anova(f, x1, x2)
     plot(anova(f))
     s <- summary(f)
     s
     plot(s, log=TRUE)
     par(mfrow=c(2,2))
     plot(f)
     par(mfrow=c(1,1))
     nomogram(f)
     g <- Function(f)
     g(11,7,'1')
     contrast(f, list(x1=11,x2=7,x3='1'), list(x1=10,x2=6,x3='2'))
     fastbw(f)
     gendata(f, x1=1:5)
     # w <- latex(f)

     f <- update(f, x=TRUE,y=TRUE)
     which.influence(f)
     residuals(f,'gof')
     robcov(f)$var
     validate(f, B=10)
     cal <- calibrate(f, B=10)
     plot(cal)

     f <- ols(y ~ rcs(x1,3) + x2 + x3, x=TRUE, y=TRUE)
     anova(f)
     anova(f, x1, x2)
     plot(anova(f))
     s <- summary(f)
     s
     plot(s, log=TRUE)
     par(mfrow=c(2,2))
     plot(f)
     par(mfrow=c(1,1))
     nomogram(f)
     g <- Function(f)
     g(11,7,'1')
     contrast(f, list(x1=11,x2=7,x3='1'), list(x1=10,x2=6,x3='2'))
     fastbw(f)
     gendata(f, x1=1:5)
     # w <- latex(f)

     f <- update(f, x=TRUE,y=TRUE)
     which.influence(f)
     residuals(f,'dfbetas')
     robcov(f)$var
     validate(f, B=10)
     cal <- calibrate(f, B=10)
     plot(cal)

     S <- Surv(c(1,4,2,3,5,8,6,7,20,18,19,9,12,10,11,13,16,14,15,17))
     survplot(survfit(S ~ x3))
     f <- psm(S ~ rcs(x1,3)+x2+x3, x=TRUE,y=TRUE)
     f
     # NOTE: LR chi-sq of 39.67 disagrees with that from old survreg
     # and old psm (77.65); suspect were also testing sigma=1

     for(w in c('survival','hazard'))
      print(survest(f, data.frame(x1=7,x2=3,x3='1'), 
            times=c(5,7), conf.int=.95, what=w))
     # S-Plus 2000 using old survival library:
     #  S(t):.925 .684 SE:0.729 0.556 Hazard:0.0734 0.255

     plot(f, x1=NA, time=5)
     f$var
     set.seed(3)
     # robcov(f)$var when score residuals implemented
     bootcov(f, B=30)$var
     validate(f, B=10)
     cal <- calibrate(f, u=5, B=10, m=10)
     plot(cal)
     r <- resid(f)
     survplot(r)

     f <- cph(S ~ rcs(x1,3)+x2+x3, x=TRUE,y=TRUE,surv=TRUE,time.inc=5)
     f
     plot(f, x1=NA, time=5)
     robcov(f)$var
     bootcov(f, B=10)
     validate(f, B=10)
     cal <- calibrate(f, u=5, B=10, m=10)
     survplot(f, x1=c(2,19))
     options(datadist=NULL)

