describe                package:Hmisc                R Documentation

_C_o_n_c_i_s_e _S_t_a_t_i_s_t_i_c_a_l _D_e_s_c_r_i_p_t_i_o_n _o_f _a _V_e_c_t_o_r, _M_a_t_r_i_x, _D_a_t_a _F_r_a_m_e, _o_r _F_o_r_m_u_l_a

_D_e_s_c_r_i_p_t_i_o_n:

     'describe' is a generic method that invokes 'describe.data.frame',
     'describe.matrix', 'describe.vector', or 'describe.formula'.
     'describe.vector' is the basic  function for handling a single
     variable. This function determines whether the variable is
     character, factor, category, binary, discrete numeric, and
     continuous numeric, and prints a concise statistical summary
     according to each. A numeric variable is deemed discrete if it has
     <= 10 unique values. In this case, quantiles are not printed. A
     frequency table is printed  for any non-binary variable if it has
     no more than 20 unique values.  For any variable with at least 20
     unique values, the 5 lowest and highest values are printed. 
     'describe' is especially useful for describing data frames created
     by 'sas.get', as SAS labels, formats, value labels, and
     frequencies of special missing values are printed.

     For a binary variable, the sum (number of 1's) and mean
     (proportion of 1's) are printed. If the first argument is a
     formula, a model frame is created and passed to
     describe.data.frame.  If a variable is of class '"impute"', a
     count of the number of imputed values is printed.  If a date
     variable has an attribute 'partial.date' (this is set up by
     'sas.get'), counts of how many partial dates are actually present
     (missing month, missing day, missing both) are also presented. If
     a variable was created by the special-purpose function 'substi'
     (which substitutes values of a second variable if the first
     variable is NA), the frequency table of substitutions is also
     printed.  

     A latex method exists for converting the 'describe' object to a
     LaTeX file.  For numeric variables having at least 20 unique
     values, 'describe' saves in its returned object the frequencies of
     100 evenly spaced bins running from minimum observed value to the
     maximum.  'latex' inserts a spike histogram displaying these
     frequency counts in the tabular material using the LaTeX picture
     environment.  For example output see <URL:
     hesweb1.med.virginia.edu/s/doc/describe.example.pdf>.

     Sample weights may be specified to any of the functions, resulting
     in weighted means, quantiles, and frequency tables.

_U_s_a_g_e:

     ## S3 method for class 'vector':
     describe(x, descript, exclude.missing=TRUE, digits=4,
              weights, normwt, ...)
     ## S3 method for class 'matrix':
     describe(x, descript, exclude.missing=TRUE, digits=4, ...)
     ## S3 method for class 'data.frame':
     describe(x, descript, exclude.missing=TRUE,
         digits=4, ...)
     ## S3 method for class 'formula':
     describe(x, descript, data, subset, na.action,
         digits=4, weights, ...)
     ## S3 method for class 'describe':
     print(x, condense=TRUE, ...)
     ## S3 method for class 'describe':
     latex(object, title=NULL, condense=TRUE, 
           file=paste('describe',first.word(expr=attr(object,'descript')),'tex',sep='.'),
           append=FALSE, size='small', tabular=TRUE, ...)
     ## S3 method for class 'describe.single':
     latex(object, title=NULL, condense=TRUE, vname,
           file, append=FALSE, size='small', tabular=TRUE, ...)

_A_r_g_u_m_e_n_t_s:

       x: a data frame, matrix, vector, or formula.  For a data frame,
          the  'describe.data.frame' function is automatically invoked.
           For a matrix, 'describe.matrix' is called.  For a formula,
          describe.data.frame(model.frame(x)) is invoked. The formula
          may or may not have a response variable.  For 'print' or
          'latex', 'x' is an object created by 'describe'. 

descript: optional title to print for x. The default is the name of the
          argument or the "label" attributes of individual variables.
          When the first argument is a formula, 'descript' defaults to
          a character representation of the formula. 

exclude.missing: set toTRUE to print the names of variables that
          contain only missing values. This list appears at the bottom
          of the printout, and no space is taken up for such variables
          in the main listing. 

  digits: number of significant digits to print 

 weights: a numeric vector of frequencies or sample weights.  Each
          observation will be treated as if it were sampled 'weights'
          times. 

  normwt: The default, 'normwt=FALSE' results in the use of 'weights'
          as weights in computing various statistics.  In this case the
          sample size is assumed to be equal to the sum of 'weights'. 
          Specify 'normwt=TRUE' to divide  'weights' by a constant so
          that 'weights' sum to the number of observations (length of
          vectors specified to 'describe').  In this case the number of
          observations is taken to be the actual number of records
          given to 'describe'. 

  object: a result of 'describe'

   title: unused

condense: default isTRUE to condense the output with regard to the 5
          lowest and highest values and the frequency table 

    data: 

  subset: 

na.action: There are used if a formula is specified.  'na.action'
          defaults to 'na.retain' which does not delete any 'NA's from
          the data frame. Use 'na.action=na.omit' or 'na.delete' to
          drop any observation with any 'NA' before processing. 

     ...: arguments passed to 'describe.default' which are passed to
          calls to 'format' for numeric variables.  For example if
          using R 'POSIXct' date/time formats, specifying
          'describe(d,format='%d%b%y')' will print date/time variables
          as '"01Jan2000"'.  This is useful for omitting the time
          component.  See the help file for 'format.POSIXct' for more
          information.  For 'latex' methods, ... is ignored.

    file: name of output file (should have a suffix of .tex).  Default
          name is formed from the first word of the 'descript' element
          of the 'describe' object, prefixed by '"describe"'.  Set
          'file=""' to send LaTeX code to standard output instead of a
          file. 

  append: set to 'TRUE' to have 'latex' append text to an existing file
          named 'file' 

    size: LaTeX text size ('"small"', the default, or '"normalsize"',
          '"tiny"', '"scriptsize"', etc.) for the 'describe' output in
          LaTeX. 

 tabular: set to 'FALSE' to use verbatim rather than tabular
          environment for the summary statistics output.  By default,
          tabular is used if the output is not too wide.

   vname: unused argument in 'latex.describe.single'

_D_e_t_a_i_l_s:

     If 'options(na.detail.response=TRUE)' has been set and 'na.action'
     is '"na.delete"' or '"na.keep"', summary  statistics on the
     response variable are printed separately for missing and
     non-missing values of each predictor.  The default summary
     function returns the number of non-missing response values and the
     mean of the last column of the response values, with a 'names'
     attribute of 'c("N","Mean")'. When the response is a 'Surv' object
     and the mean is used, this will result in the crude proportion of
     events being used to summarize the response.  The actual summary
     function can be designated through 'options(na.fun.response =
     "function name")'.

_V_a_l_u_e:

     a list containing elements 'descript', 'counts', 'values'.  The
     list  is of class 'describe'.  If the input object was a matrix or
     a data  frame, the list is a list of lists, one list for each
     variable analyzed. 'latex' returns a standard 'latex' object.  For
     numeric variables having at least 20 unique values, an additional
     component 'intervalFreq'.  This component is a list with two
     elements, 'range' (containing two values) and 'count', a vector of
     100 integer frequency counts.

_A_u_t_h_o_r(_s):

     Frank Harrell 
      Vanderbilt University 
      f.harrell@vanderbilt.edu

_S_e_e _A_l_s_o:

     'sas.get', 'quantile', 'table', 'summary',  'model.frame.default',
     'naprint', 'lapply', 'tapply', 'Surv', 'na.delete', 'na.keep',
     'na.detail.response', 'latex'

_E_x_a_m_p_l_e_s:

     set.seed(1)
     describe(runif(200),dig=2)    #single variable, continuous
                                   #get quantiles .05,.10,...

     dfr <- data.frame(x=rnorm(400),y=sample(c('male','female'),400,TRUE))
     describe(dfr)

     ## Not run: 
     d <- sas.get(".","mydata",special.miss=TRUE,recode=TRUE)
     describe(d)      #describe entire data frame
     attach(d, 1)
     describe(relig)  #Has special missing values .D .F .M .R .T
                      #attr(relig,"label") is "Religious preference"

     #relig : Religious preference  Format:relig
     #    n missing  D  F M R T unique 
     # 4038     263 45 33 7 2 1      8
     #
     #0:none (251, 6%), 1:Jewish (372, 9%), 2:Catholic (1230, 30%) 
     #3:Jehovah's Witnes (25, 1%), 4:Christ Scientist (7, 0%) 
     #5:Seventh Day Adv (17, 0%), 6:Protestant (2025, 50%), 7:other (111, 3%) 

     # Method for describing part of a data frame:
      describe(death.time ~ age*sex + rcs(blood.pressure))
      describe(~ age+sex)
      describe(~ age+sex, weights=freqs)  # weighted analysis

      fit <- lrm(y ~ age*sex + log(height))
      describe(formula(fit))
      describe(y ~ age*sex, na.action=na.delete)   
     # report on number deleted for each variable
      options(na.detail.response=TRUE)  
     # keep missings separately for each x, report on dist of y by x=NA
      describe(y ~ age*sex)
      options(na.fun.response="quantile")
      describe(y ~ age*sex)   # same but use quantiles of y by x=NA

      d <- describe(my.data.frame)
      d$age                   # print description for just age
      d[c('age','sex')]       # print description for two variables
      d[sort(names(d))]       # print in alphabetic order by var. names
      d2 <- d[20:30]          # keep variables 20-30
      page(d2)                # pop-up window for these variables

     # Test date/time formats and suppression of times when they don't vary
      library(chron)
      d <- data.frame(a=chron((1:20)+.1),
                      b=chron((1:20)+(1:20)/100),
                      d=ISOdatetime(year=rep(2003,20),month=rep(4,20),day=1:20,
                                    hour=rep(11,20),min=rep(17,20),sec=rep(11,20)),
                      f=ISOdatetime(year=rep(2003,20),month=rep(4,20),day=1:20,
                                    hour=1:20,min=1:20,sec=1:20),
                      g=ISOdate(year=2001:2020,month=rep(3,20),day=1:20))
      describe(d)

     ## End(Not run)

