summarize               package:Hmisc               R Documentation

_S_u_m_m_a_r_i_z_e _S_c_a_l_a_r_s _o_r _M_a_t_r_i_c_e_s _b_y _C_r_o_s_s-_C_l_a_s_s_i_f_i_c_a_t_i_o_n

_D_e_s_c_r_i_p_t_i_o_n:

     'summarize' is a fast version of 'summary(formula,
     method="cross",overall=FALSE)' for producing stratified summary
     statistics and storing them in a data frame for plotting
     (especially with trellis 'xyplot' and 'dotplot' and Hmisc
     'xYplot').  Unlike 'aggregate', 'summarize' accepts a matrix as
     its first argument and a multi-valued 'FUN' argument and
     'summarize' also labels the variables in the new data frame using
     their original names.  Unlike methods based on 'tapply',
     'summarize' stores the values of the stratification variables
     using their original types, e.g., a numeric 'by' variable will
     remain a numeric variable in the collapsed data frame. 'summarize'
     also retains '"label"' attributes for variables. 'summarize' works
     especially well with the Hmisc 'xYplot' function for displaying
     multiple summaries of a single variable on each panel, such as
     means and upper and lower confidence limits.

     'mApply' is like 'tapply' except that the first argument can be a
     matrix, and the output is cleaned up if 'simplify=TRUE'.  It uses
     code adapted from Tony Plate (tplate@blackmesacapital.com) to
     operate on grouped submatrices.

     As 'mApply' can be much faster than using 'by', it is often worth
     the trouble of converting a data frame to a numeric matrix for
     processing by 'mApply'.  'asNumericMatrix' will do this, and
     'matrix2dataFrame' will convert a numeric matrix back into a data
     frame if attributes and storage modes of the original variables
     are saved by calling 'subsAttr'.  'subsAttr' saves attributes that
     are commonly preserved across row subsetting (i.e., it does not
     save 'dim', 'dimnames', or 'names' attributes).

_U_s_a_g_e:

     summarize(X, by, FUN, ..., 
               stat.name=deparse(substitute(X)),
               type=c('variables','matrix'), subset=TRUE)

     mApply(X, INDEX, FUN=NULL, ..., simplify=TRUE)

     asNumericMatrix(x)

     subsAttr(x)

     matrix2dataFrame(x, at, restoreAll=TRUE)

_A_r_g_u_m_e_n_t_s:

       X: a vector or matrix capable of being operated on by the
          function specified as the 'FUN' argument 

      by: one or more stratification variables.  If a single variable,
          'by' may be a vector, otherwise it should be a list. Using
          the Hmisc 'llist' function instead of 'list' will result in
          individual variable names being accessible to 'summarize'. 
          For example, you can specify 'llist(age.group,sex)' or
          'llist(Age=age.group,sex)'.  The latter gives 'age.group' a
          new temporary name, 'Age'.  

     FUN: a function of a single vector argument, used to create the
          statistical summaries for 'summarize'.  'FUN' may compute any
          number of statistics.  

simplify: set to 'FALSE' to suppress simplification of the result in to
          an array, matrix, etc.

     ...: extra arguments are passed to 'FUN'

stat.name: the name to use when creating the main summary variable.  By
          default, the name of the 'X' argument is used. 

    type: Specify 'type="matrix"' to store the summary variables (if
          there are more than one) in a matrix. 

  subset: a logical vector or integer vector of subscripts used to
          specify the subset of data to use in the analysis.  The
          default is to use all observations in the data frame. 

   INDEX: vector or list of vectors to cross-classify on, similar to
          'by'. See 'tapply'.

       x: a data frame (for 'asNumericMatrix') or a numeric matrix (for
          'matrix2dataFrame').  For 'subsAttr', 'x' may be a data
          frame, list, or a vector. 

      at: result of 'subsAttr' 

restoreAll: set to 'FALSE' to only restore attributes 'label', 'units',
          and 'levels' instead of all attributes 

_V_a_l_u_e:

     For 'summarize', a data frame containing the 'by' variables and
     the statistical summaries (the first of which is named the same as
     the 'X' variable unless 'stat.name' is given).  If
     'type="matrix"', the summaries are stored in a single variable in
     the data frame, and this variable is a matrix.  For 'mApply', the
     returned value is a vector, matrix, or list.  If 'FUN' returns
     more than one number, the result is an array if 'simplify=TRUE'
     and is a list otherwise.  If a matrix is returned, its rows
     correspond to unique combinations of 'INDEX'.  If 'INDEX' is a
     list with more than one vector, 'FUN' returns more than one
     number, and 'simplify=FALSE', the returned value is a list that is
     an array with the first dimension corresponding to the last vector
     in 'INDEX', the second dimension corresponding to the next to last
     vector in 'INDEX', etc., and the elements of the list-array
     correspond to the values computed by 'FUN'.  In this situation the
     returned value is a regular array if 'simplify=TRUE'.   The order
     of dimensions is as previously but the additional (last) dimension
     corresponds to values computed by 'FUN'.  'asNumericMatrix'
     returns a numeric matrix, and 'matrix2dataFrame' returns a data
     frame.  'subsAttr' returns a list of attribute lists if its
     argument is a list or data frame, and a list containing attributes
     of a single variable.

_A_u_t_h_o_r(_s):

     Frank Harrell 
      Department of Biostatistics 
      Vanderbilt University 
      f.harrell@vanderbilt.edu

_S_e_e _A_l_s_o:

     'label', 'cut2', 'llist', 'by'

_E_x_a_m_p_l_e_s:

     ## Not run: 
     s <- summarize(ap>1, llist(size=cut2(sz, g=4), bone), mean,
                    stat.name='Proportion')
     dotplot(Proportion ~ size | bone, data=s7)
     ## End(Not run)

     set.seed(1)
     temperature <- rnorm(300, 70, 10)
     month <- sample(1:12, 300, TRUE)
     year  <- sample(2000:2001, 300, TRUE)
     g <- function(x)c(Mean=mean(x,na.rm=TRUE),Median=median(x,na.rm=TRUE))
     summarize(temperature, month, g)
     mApply(temperature, month, g)

     mApply(temperature, month, mean, na.rm=TRUE)
     w <- summarize(temperature, month, mean, na.rm=TRUE)
     if(.R.) library(lattice)
     xyplot(temperature ~ month, data=w) # plot mean temperature by month

     w <- summarize(temperature, llist(year,month), 
                    quantile, probs=c(.5,.25,.75), na.rm=TRUE, type='matrix')
     xYplot(Cbind(temperature[,1],temperature[,-1]) ~ month | year, data=w)
     mApply(temperature, llist(year,month),
            quantile, probs=c(.5,.25,.75), na.rm=TRUE)

     # Compute the median and outer quartiles.  The outer quartiles are
     # displayed using "error bars"
     set.seed(111)
     dfr <- expand.grid(month=1:12, year=c(1997,1998), reps=1:100)
     attach(dfr)
     y <- abs(month-6.5) + 2*runif(length(month)) + year-1997
     s <- summarize(y, llist(month,year), smedian.hilow, conf.int=.5)
     s
     mApply(y, llist(month,year), smedian.hilow, conf.int=.5)

     xYplot(Cbind(y,Lower,Upper) ~ month, groups=year, data=s, 
            keys='lines', method='alt')
     # Can also do:
     s <- summarize(y, llist(month,year), quantile, probs=c(.5,.25,.75),
                    stat.name=c('y','Q1','Q3'))
     xYplot(Cbind(y, Q1, Q3) ~ month, groups=year, data=s, keys='lines')
     # To display means and bootstrapped nonparametric confidence intervals
     # use for example:
     s <- summarize(y, llist(month,year), smean.cl.boot)
     xYplot(Cbind(y, Lower, Upper) ~ month | year, data=s)

     # For each subject use the trapezoidal rule to compute the area under
     # the (time,response) curve using the Hmisc trap.rule function
     x <- cbind(time=c(1,2,4,7, 1,3,5,10),response=c(1,3,2,4, 1,3,2,4))
     subject <- c(rep(1,4),rep(2,4))
     trap.rule(x[1:4,1],x[1:4,2])
     summarize(x, subject, function(y) trap.rule(y[,1],y[,2]))

     ## Not run: 
     # Another approach would be to properly re-shape the mm array below
     # This assumes no missing cells.  There are many other approaches.
     # mApply will do this well while allowing for missing cells.
     m <- tapply(y, list(year,month), quantile, probs=c(.25,.5,.75))
     mm <- array(unlist(m), dim=c(3,2,12), 
                 dimnames=list(c('lower','median','upper'),c('1997','1998'),
                               as.character(1:12)))
     # aggregate will help but it only allows you to compute one quantile
     # at a time; see also the Hmisc mApply function
     dframe <- aggregate(y, list(Year=year,Month=month), quantile, probs=.5)

     # Compute expected life length by race assuming an exponential
     # distribution - can also use summarize
     g <- function(y) { # computations for one race group
       futime <- y[,1]; event <- y[,2]
       sum(futime)/sum(event)  # assume event=1 for death, 0=alive
     }
     mApply(cbind(followup.time, death), race, g)

     # To run mApply on a data frame:
     m <- mApply(asNumericMatrix(x), race, h)
     # Here assume h is a function that returns a matrix similar to x
     at <- subsAttr(x)  # get original attributes and storage modes
     matrix2dataFrame(m, at)

     # Get stratified weighted means
     g <- function(y) wtd.mean(y[,1],y[,2])
     summarize(cbind(y, wts), llist(sex,race), g, stat.name='y')
     mApply(cbind(y,wts), llist(sex,race), g)

     # Compare speed of mApply vs. by for computing 
     d <- data.frame(sex=sample(c('female','male'),100000,TRUE),
                     country=sample(letters,100000,TRUE),
                     y1=runif(100000), y2=runif(100000))
     g <- function(x) {
       y <- c(median(x[,'y1']-x[,'y2']),
              med.sum =median(x[,'y1']+x[,'y2']))
       names(y) <- c('med.diff','med.sum')
       y
     }

     system.time(by(d, llist(sex=d$sex,country=d$country), g))
     system.time({
                  x <- asNumericMatrix(d)
                  a <- subsAttr(d)
                  m <- mApply(x, llist(sex=d$sex,country=d$country), g)
                 })
     system.time({
                  x <- asNumericMatrix(d)
                  summarize(x, llist(sex=d$sex, country=d$country), g)
                 })

     # An example where each subject has one record per diagnosis but sex of
     # subject is duplicated for all the rows a subject has.  Get the cross-
     # classified frequencies of diagnosis (dx) by sex and plot the results
     # with a dot plot

     count <- rep(1,length(dx))
     d <- summarize(count, llist(dx,sex), sum)
     Dotplot(dx ~ count | sex, data=d)
     ## End(Not run)
     detach('dfr')

