dataRep                package:Hmisc                R Documentation

_R_e_p_r_e_s_e_n_t_a_t_i_v_e_n_e_s_s _o_f _O_b_s_e_r_v_a_t_i_o_n_s _i_n _a _D_a_t_a _S_e_t

_D_e_s_c_r_i_p_t_i_o_n:

     These functions are intended to be used to describe how well a
     given set of new observations (e.g., new subjects) were
     represented in a dataset used to develop a predictive model. The
     'dataRep' function forms a data frame that contains all the unique
     combinations of variable values that existed in a given set of
     variable values.  Cross-classifications of values are created
     using exact values of variables, so for continuous numeric
     variables it is often necessary to round them to the nearest 'v'
     and to possibly curtail the values to some lower and upper limit
     before rounding. Here 'v' denotes a numeric constant specifying
     the matching tolerance that will be used.  'dataRep' also stores
     marginal distribution summaries for all the variables.  For
     numeric variables, all 101 percentiles are stored, and for all
     variables, the frequency distributions are also stored
     (frequencies are computed after any rounding and curtailment of
     numeric variables).  For the purposes of rounding and curtailing,
     the 'roundN' function is provided.  A 'print' method will
     summarize the calculations made by 'dataRep', and if 'long=TRUE'
     all unique combinations of values and their frequencies in the
     original dataset are printed.

     The 'predict' method for 'dataRep' takes a new data frame having
     variables named the same as the original ones (but whose factor
     levels are not necessarily in the same order) and examines the
     collapsed cross-classifications created by 'dataRep' to find how
     many observations were similar to each of the new observations
     after any rounding or curtailment of limits is done.  'predict'
     also does some calculations to describe how the variable values of
     the new observations "stack up" against the marginal distributions
     of the original data.  For categorical variables, the percent of
     observations having a given variable with the value of the new
     observation (after rounding for variables that were through
     'roundN' in the formula given to 'dataRep') is computed.  For
     numeric variables, the percentile of the original distribution in
     which the current value falls will be computed.  For this purpose,
     the data are not rounded because the 101 original percentiles were
     retained; linear interpolation is used to estimate percentiles for
     values between two tabulated percentiles. The lowest marginal
     frequency of matching values across all variables is also
     computed.  For example, if an age, sex combination matches 10
     subjects in the original dataset but the age value matches 100
     ages (after rounding) and the sex value matches the sex code of
     300 observations, the lowest marginal frequency is 100, which is a
     "best case" upper limit for multivariable matching.  I.e.,
     matching on all variables has to result on a lower frequency than
     this amount. A 'print' method for the output of 'predict.dataRep'
     prints all calculations done by 'predict' by default. 
     Calculations can be selectively suppressed.

_U_s_a_g_e:

     dataRep(formula, data, subset, na.action)

     roundN(x, tol=1, clip=NULL)

     ## S3 method for class 'dataRep':
     print(x, long=FALSE, ...)

     ## S3 method for class 'dataRep':
     predict(object, newdata, ...)

     ## S3 method for class 'predict.dataRep':
     print(x, prdata=TRUE, prpct=TRUE, ...)

_A_r_g_u_m_e_n_t_s:

 formula: a formula with no left-hand-side.  Continuous numeric
          variables in need of rounding should appear in the formula as
          e.g. 'roundN(x,5)' to have a tolerance of e.g. +/- 2.5 in
          matching.  Factor or character variables as well as numeric
          ones not passed through 'roundN' are matched on exactly. 

       x: a numeric vector or an object created by 'dataRep' 

  object: the object created by 'dataRep' or 'predict.dataRep' 

data, subset, na.action: standard modeling arguments.  Default
          'na.action' is 'na.delete', i.e., observations in the
          original dataset having any variables missing are deleted up
          front. 

     tol: rounding constant (tolerance is actually 'tol/2' as values
          are rounded to the nearest 'tol') 

    clip: a 2-vector specifying a lower and upper limit to curtail
          values of 'x' before rounding 

    long: set to 'TRUE' to see all unique combinations and frequency
          count 

 newdata: a data frame containing all the variables given to 'dataRep'
          but not necessarily in the same order or having factor levels
          in the same order 

  prdata: set to 'FALSE' to suppress printing 'newdata' and the count
          of matching observations (plus the worst-case marginal
          frequency).  

   prpct: set to 'FALSE' to not print percentiles and percents

     ...: unused

_V_a_l_u_e:

     'dataRep' returns a list of class '"dataRep"' containing the
     collapsed data frame and frequency counts along with marginal
     distribution information.  'predict' returns an object of class
     '"predict.dataRep"' containing information determined by matching
     observations in 'newdata' with the original (collapsed) data.

_S_i_d_e _E_f_f_e_c_t_s:

     'print.dataRep' prints.

_A_u_t_h_o_r(_s):

     Frank Harrell 
      Department of Biostatistics 
      Vanderbilt University School of Medicine 
      f.harrell@vanderbilt.edu

_S_e_e _A_l_s_o:

     'round', 'table'

_E_x_a_m_p_l_e_s:

     set.seed(13)
     num.symptoms <- sample(1:4, 1000,TRUE)
     sex <- factor(sample(c('female','male'), 1000,TRUE))
     x    <- runif(1000)
     x[1] <- NA
     table(num.symptoms, sex, .25*round(x/.25))

     d <- dataRep(~ num.symptoms + sex + roundN(x,.25))
     print(d, long=TRUE)

     predict(d, data.frame(num.symptoms=1:3, sex=c('male','male','female'),
                           x=c(.03,.5,1.5)))

