datadist               package:Design               R Documentation

_D_i_s_t_r_i_b_u_t_i_o_n _S_u_m_m_a_r_i_e_s _f_o_r _P_r_e_d_i_c_t_o_r _V_a_r_i_a_b_l_e_s

_D_e_s_c_r_i_p_t_i_o_n:

     For a given set of variables or a data frame, determines summaries
     of variables for effect and plotting ranges, values to adjust to,
     and overall ranges for 'plot.Design', 'summary.Design',
     'survplot', and 'nomogram.Design'. If 'datadist' is called before
     a model fit and the resulting object pointed to with
     'options(datadist="name")', the data characteristics will be
     stored with the fit by 'Design()', so that later predictions and
     summaries of the fit will not need to access the original data
     used in the fit.  Alternatively, you can specify the values for
     each variable in the model when using these 3 functions, or
     specify the values of some of them and let the functions look up
     the remainder (of say adjustmemt levels) from an object created by
     'datadist'. The best method is probably to run 'datadist' once
     before any models are fitted, storing the distribution summaries
     for all potential variables. Adjustment values are '0' for binary
     variables, the most frequent category (or optionally the first
     category level) for categorical ('factor') variables, the middle
     level for  'ordered factor' variables, and medians for continuous
     variables. See descriptions of 'q.display' and 'q.effect' for how
     display and effect ranges are chosen for continuous variables.

_U_s_a_g_e:

     datadist(..., data, q.display, q.effect=c(0.25, 0.75),
              adjto.cat=c('mode','first'), n.unique=10)

     ## S3 method for class 'datadist':
     print(x, ...)
     # options(datadist="dd")
     # used by summary, plot, survplot, sometimes predict
     # For dd substitute the name of the result of datadist

_A_r_g_u_m_e_n_t_s:

     ...: a list of variable names, separated by commas, a single data
          frame, or a fit with 'Design' information.  The first element
          in this list may also be an object created by an earlier call
          to 'datadist'; then the later variables are added to this
          'datadist' object. For a fit object, the variables named in
          the fit are retrieved from the active data frame or from the
          location pointed to by 'data=frame number' or 'data="data
          frame name"'. For 'print', is ignored. 

    data: a data frame or a search position.  If 'data' is a search
          position, it is assumed that a data frame is attached in that
          position, and all its variables are used.  If you specify
          both individual variables in '...' and 'data', the two sets
          of variables are combined.  Unless the first argument is a
          fit object, 'data' must be an integer. 

q.display: set of two quantiles for computing the range of continuous
          variables to use in displaying regression relationships. 
          Defaults are q and 1-q, where q=10/max(n,200), and n is the
          number of  non-missing observations.  Thus for n<200, the .05
          and .95 quantiles are used.  For n>=q 200, the 10^{th}
          smallest and 10^{th} largest values are used.  If you specify
          'q.display', those quantiles are used whether or not n<200. 

q.effect: set of two quantiles for computing the range of continuous
          variables to use in estimating regression effects.  Defaults
          are c(.25,.75), which yields inter-quartile-range odds
          ratios, etc. 

adjto.cat: default is '"mode"', indicating that the modal (most
          frequent) category for categorical (factor) variables is the
          adjust-to setting. Specify '"first"' to use the first level
          of factor variables as the adjustment values.  In the case of
          many levels having the maximum frequency, the first such
          level is used for '"mode"'. 

n.unique: variables having 'n.unique' or fewer unique values are
          considered to be discrete variables in that their unique
          values are stored in the 'values' list.  This will affect how
          functions such as 'nomogram.Design' determine whether
          variables are discrete or not. 

       x: result of 'datadist'

_D_e_t_a_i_l_s:

     For categorical variables, the 7 limits are set to character
     strings (factors) which correspond to
     'c(NA,adjto.level,NA,1,k,1,k)', where 'k' is the number of levels.
     For ordered variables with numeric levels, the limits are set to
     'c(L,M,H,L,H,L,H)', where 'L' is the lowest level, 'M' is the
     middle level, and 'H' is the highest level.

_V_a_l_u_e:

     a list of class '"datadist"' with the following components

  limits: a 7 times k vector, where k is the number of variables. The 7
          rows correspond to the low value for estimating the effect of
          the variable, the value to adjust the variable to when
          examining other variables, the high value for effect, low
          value for displaying the variable, the high value for
          displaying it, and the overall lowest and highest values. 

  values: a named list, with one vector of unique values for each
          numeric variable having no more than 'n.unique' unique values 

_A_u_t_h_o_r(_s):

     Frank Harrell
      Department of Biostatistics
      Vanderbilt University
      f.harrell@vanderbilt.edu

_S_e_e _A_l_s_o:

     'Design', 'Design.trans', 'describe', 'plot.Design',
     'summary.Design'

_E_x_a_m_p_l_e_s:

     ## Not run: 
     d <- datadist(data=1)         # use all variables in search pos. 1
     d <- datadist(x1, x2, x3)
     page(d)                       # if your options(pager) leaves up a pop-up
                                   # window, this is a useful guide in analyses
     d <- datadist(data=2)         # all variables in search pos. 2
     d <- datadist(data=my.data.frame)
     d <- datadist(my.data.frame)  # same as previous.  Run for all potential vars.
     d <- datadist(x2, x3, data=my.data.frame)   # combine variables
     d <- datadist(x2, x3, q.effect=c(.1,.9), q.display=c(0,1))
     # uses inter-decile range odds ratios,
     # total range of variables for regression function plots
     d <- datadist(d, z)           # add a new variable to an existing datadist
     options(datadist="d")         #often a good idea, to store info with fit
     f <- ols(y ~ x1*x2*x3)

     options(datadist=NULL)        #default at start of session
     f <- ols(y ~ x1*x2)
     d <- datadist(f)              #info not stored in `f'
     d$limits["Adjust to","x1"] <- .5   #reset adjustment level to .5
     options(datadist="d")

     f <- lrm(y ~ x1*x2, data=mydata)
     d <- datadist(f, data=mydata)
     options(datadist="d")

     f <- lrm(y ~ x1*x2)           #datadist not used - specify all values for
     summary(f, x1=c(200,500,800), x2=c(1,3,5))         # obtaining predictions
     plot(f, x1=200:800, x2=3)

     # Change reference value to get a relative odds plot for a logistic model
     d$limits$age[2] <- 30    # make 30 the reference value for age
     # Could also do: d$limits["Adjust to","age"] <- 30
     fit <- update(fit)   # make new reference value take effect
     plot(fit, age=NA, ref.zero=TRUE, fun=exp, ylab='Age=x:Age=30 Odds Ratio')
     ## End(Not run)

