varclus                package:Hmisc                R Documentation

_V_a_r_i_a_b_l_e _C_l_u_s_t_e_r_i_n_g

_D_e_s_c_r_i_p_t_i_o_n:

     Does a hierarchical cluster analysis on variables, using the
     Hoeffding D statistic, squared Pearson or Spearman correlations,
     or proportion of observations for which two variables are both
     positive as similarity measures.  Variable clustering is used for
     assessing collinearity, redundancy, and for separating variables
     into clusters that can be scored as a single variable, thus
     resulting in data reduction.  For computing any of the three
     similarity measures, pairwise deletion of NAs is done.  The
     clustering is done by 'hclust()'.  A small function 'naclus' is
     also provided which depicts similarities in which observations are
     missing for variables in a data frame.  The similarity measure is
     the fraction of 'NAs' in common between any two variables.  The
     diagonals of this 'sim' matrix are the fraction of NAs in each
     variable by itself.  'naclus' also computes 'na.per.obs', the
     number of missing variables in each observation, and 'mean.na', a
     vector whose ith element is the mean number of missing variables
     other than variable i, for observations in which variable i is
     missing.  The 'naplot' function makes several plots (see the
     'which' argument).

     So as to not generate too many dummy variables for multi-valued
     character or categorical predictors, 'varclus' will automatically
     combine infrequent cells of such variables using an auxiliary
     function 'combine.levels' that is defined here.

     'plotMultSim' plots multiple similarity matrices, with the
     similarity measure being on the x-axis of each subplot.

     'na.pattern' prints a frequency table of all combinations of
     missingness for multiple variables.  If there are 3 variables, a
     frequency table entry labeled '110' corresponds to the number of
     observations for which the first and second variables were missing
     but the third variable was not missing.

_U_s_a_g_e:

     varclus(x, similarity=c("spearman","pearson","hoeffding","bothpos","ccbothpos"),
             type=c("data.matrix","similarity.matrix"), 
             method=if(.R.)"complete" else "compact",
             data, subset, na.action, minlev=0.05)
     ## S3 method for class 'varclus':
     print(x, ...)
     ## S3 method for class 'varclus':
     plot(x, ylab, abbrev=FALSE, legend.=FALSE, loc, maxlen, labels, ...)

     naclus(df, method)
     naplot(obj, which=c('all','na per var','na per obs','mean na',
                         'na per var vs mean na'), ...)

     combine.levels(x, minlev=.05)

     plotMultSim(s, x=1:dim(s)[3],
                 slim=range(pretty(c(0,max(s,na.rm=TRUE)))),
                 slimds=FALSE,
                 add=FALSE, lty=par('lty'), col=par('col'),
                 lwd=par('lwd'), vname=NULL, h=.5, w=.75, u=.05,
                 labelx=TRUE, xspace=.35)

     na.pattern(x)

_A_r_g_u_m_e_n_t_s:

       x: a formula, a numeric matrix of predictors, or a similarity
          matrix.  If 'x' is a formula, 'model.matrix' is used to
          convert it to a design matrix. If the formula excludes an
          intercept (e.g., '~ a + b -1'), the first categorical
          ('factor') variable in the formula will have dummy variables
          generated for all levels instead of omitting one for the
          first level. For 'combine.levels', 'x' is a character,
          category, or factor vector (or other vector that is converted
          to factor).  For 'plot' and 'print', 'x' is an object created
          by 'varclus'.  For 'na.pattern', 'x' is a list, data frame,
          or numeric matrix.

          For 'plotMultSim', is a numeric vector specifying the ordered
          unique values on the x-axis, corresponding to the third
          dimension of 's'. 

      df: a data frame

       s: an array of similarity matrices.  The third dimension of this
          array corresponds to different computations of similarities. 
          The first two dimensions come from a single similarity
          matrix.  This is useful for displaying similarity matrices
          computed by 'varclus', for example.  A use for this might be
          to show pairwise similarities of variables across time in a
          longitudinal study (see the example below).  If 'vname' is
          not given, 's' must have 'dimnames'. 

similarity: the default is to use squared Spearman correlation
          coefficients, which will detect monotonic but nonlinear
          relationships.  You can also specify linear correlation or
          Hoeffding's (1948) D statistic, which has the advantage of
          being sensitive to many types of dependence, including highly
          non-monotonic relationships.  For binary data, or data to be
          made binary, 'similarity="bothpos"' uses as a similarity
          measure the proportion of observations for which two
          variables are both positive.  'similarity="ccbothpos"' uses a
          chance-corrected measure which is the proportion of
          observations for which both variables are positive minus the
          product of the two marginal proportions.  This difference is
          expected to be zero under independence.  For diagonals,
          '"ccbothpos"' still uses the proportion of positives for the
          single variable.  So '"ccbothpos"' is not really a similarity
          measure, and clustering is not done.  This measure is useful
          for plotting with 'plotMultSim' (see the last example). 

    type: if 'x' is not a formula, it may be a data matrix or a
          similarity matrix. By default, it is assumed to be a data
          matrix. 

  method: see 'hclust'.  The default, for both 'varclus' and 'naclus',
          is '"compact"' (for R it is '"complete"'). 

    data: 

  subset: 

na.action: These may be specified if 'x' is a formula.  The default
          'na.action' is 'na.retain', defined by 'varclus'.  This
          causes all observations to be kept in the model frame, with
          later pairwise deletion of 'NA's. 

    ylab: y-axis label.  Default is constructed on the basis of
          'similarity'. 

  abbrev: set to 'TRUE' to abbreviate variable names for plotting.  Is
          set to 'TRUE' automatically if 'legend=TRUE'. 

 legend.: set to 'TRUE' to plot a legend defining the abbreviations 

     loc: a list with elements 'x' and 'y' defining coordinates of the
          upper left corner of the legend.  Default is 'locator(1)'. 

  maxlen: if a legend is plotted describing abbreviations, original
          labels longer than 'maxlen' characters are truncated at
          'maxlen'. 

  labels: a vector of character strings containing labels corresponding
          to columns in the similar matrix, if the column names of that
          matrix are not to be used 

     ...: passed to 'plclust' (or to 'dotchart' or 'dotchart2' for
          'naplot'). 

     obj: an object created by 'naclus'

   which: defaults to '"all"' meaning to have 'naplot' make 4 separate
          plots.  To  make only one of the plots, use 'which="na per
          var"' (dot chart of fraction of NAs for each variable), ,'"na
          per obs"' (dot chart showing frequency distribution of number
          of variables having NAs in an observation), '"mean na"' (dot
          chart showing mean number of other variables missing when the
          indicated variable is missing), or  '"na per var vs mean
          na"', a scatterplot showing on the x-axis the fraction of NAs
          in the variable and on the y-axis the mean number of other
          variables that are NA when the indicated variable is NA. 

  minlev: the minimum proportion of observations in a cell before that
          cell is combined with one or more cells.  If more than one
          cell has fewer than minlev*n observations, all such cells are
          combined into a new cell labeled '"OTHER"'.  Otherwise, the
          lowest frequency cell is combined with the next lowest
          frequency cell, and the level name is the combination of the
          two old level levels. 

    slim: 2-vector specifying the range of similarity values for
          scaling the y-axes.  By default this is the observed range
          over all of 's'. 

  slimds: set to 'slimds' to 'TRUE' to scale diagonals and
          off-diagonals separately

     add: set to 'TRUE' to add similarities to an existing plot
          (usually specifying 'lty' or 'col') 

     lty: 

     col: 

     lwd: line type, color, or line thickness for 'plotMultSim' 

   vname: optional vector of variable names, in order, used in 's' 

       h: relative height for subplot 

       w: relative width for subplot 

       u: relative extra height and width to leave unused inside the
          subplot. Also used as the space between y-axis tick mark
          labels and graph border. 

  labelx: set to 'FALSE' to suppress drawing of labels in the x
          direction 

  xspace: amount of space, on a scale of 1:'n' where 'n' is the number
          of variables, to set aside for y-axis labels 

_D_e_t_a_i_l_s:

     'options(contrasts= c("contr.treatment", "contr.poly"))' is issued
      temporarily by 'varclus' to make sure that ordinary dummy
     variables are generated for 'factor' variables.  If a categorical
     or character variable has no level containing at least a fraction
     'minlev' of the data, that variable is omitted from consideration
     and a warning is printed.

_V_a_l_u_e:

     for 'varclus' or 'naclus', a list of class 'varclus' with elements
     'call' (containing the calling statement), 'sim' (similarity
     matrix), 'n' (sample size used if 'x' was not a correlation matrix
     already - 'n' is a matrix), 'hclust', the object created by
     'hclust', 'similarity', and 'method'.  For 'plot', returns the
     object created by 'plclust'.  'naclus' also returns the two
     vectors listed under description, and 'naplot' returns an
     invisible vector that is the frequency table of the number of
     missing variables per observation. 'plotMultSim' invisibly returns
     the limits of similarities used in constructing the y-axes of each
     subplot.  For 'similarity="ccbothpos"' the 'hclust' object is
     'NULL'.

     'na.pattern' creates an integer vector of frequencies.

_S_i_d_e _E_f_f_e_c_t_s:

     plots

_A_u_t_h_o_r(_s):

     Frank Harrell 
      Department of Biostatistics, Vanderbilt University 
      f.harrell@vanderbilt.edu

_R_e_f_e_r_e_n_c_e_s:

     Sarle, WS: The VARCLUS Procedure.  SAS/STAT User's Guide, 4th
     Edition, 1990.  Cary NC: SAS Institute, Inc.

     Hoeffding W. (1948): A non-parametric test of independence.  Ann
     Math Stat 19:546-57.

_S_e_e _A_l_s_o:

     'hclust', 'plclust', 'hoeffd', 'rcorr', 'cor', 'model.matrix',
     'locator', 'na.pattern'

_E_x_a_m_p_l_e_s:

     set.seed(1)
     x1 <- rnorm(200)
     x2 <- rnorm(200)
     x3 <- x1 + x2 + rnorm(200)
     x4 <- x2 + rnorm(200)
     x <- cbind(x1,x2,x3,x4)
     v <- varclus(x, similarity="spear")  # spearman is the default anyway
     v    # invokes print.varclus
     print(round(v$sim,2))
     plot(v)

     # plot(varclus(~ age + sys.bp + dias.bp + country - 1), abbrev=TRUE)
     # the -1 causes k dummies to be generated for k countries
     # plot(varclus(~ age + factor(disease.code) - 1))
     #

     df <- data.frame(a=c(1,2,3),b=c(1,2,3),c=c(1,2,NA),d=c(1,NA,3),
                      e=c(1,NA,3),f=c(NA,NA,NA),g=c(NA,2,3),h=c(NA,NA,3))
     par(mfrow=c(2,2))
     for(m in if(.R.)c("ward","complete","median") else 
                     c("compact","connected","average")) {
       plot(naclus(df, method=m))
       title(m)
     }
     naplot(naclus(df))
     n <- naclus(df)
     plot(n); naplot(n)
     na.pattern(df)      # builtin function

     x <- c(1, rep(2,11), rep(3,9))
     combine.levels(x)
     x <- c(1, 2, rep(3,20))
     combine.levels(x)

     # plotMultSim example: Plot proportion of observations
     # for which two variables are both positive (diagonals
     # show the proportion of observations for which the
     # one variable is positive).  Chance-correct the
     # off-diagonals by subtracting the product of the
     # marginal proportions.  On each subplot the x-axis
     # shows month (0, 4, 8, 12) and there is a separate
     # curve for females and males
     d <- data.frame(sex=sample(c('female','male'),1000,TRUE),
                     month=sample(c(0,4,8,12),1000,TRUE),
                     x1=sample(0:1,1000,TRUE),
                     x2=sample(0:1,1000,TRUE),
                     x3=sample(0:1,1000,TRUE))
     s <- array(NA, c(3,3,4))
     opar <- par(mar=c(0,0,4.1,0))  # waste less space
     for(sx in c('female','male')) {
       for(i in 1:4) {
         mon <- (i-1)*4
         s[,,i] <- varclus(~x1 + x2 + x3, sim='ccbothpos', data=d,
                           subset=month==mon & sex==sx)$sim
         }
       plotMultSim(s, c(0,4,8,12), vname=c('x1','x2','x3'),
                   add=sx=='male', slimds=TRUE,
                   lty=1+(sx=='male'))
       # slimds=TRUE causes separate  scaling for diagonals and
       # off-diagonals
     }
     par(opar)

