scat1d                 package:Hmisc                 R Documentation

_O_n_e-_D_i_m_e_n_s_i_o_n_a_l _S_c_a_t_t_e_r _D_i_a_g_r_a_m, _S_p_i_k_e _H_i_s_t_o_g_r_a_m, _o_r _D_e_n_s_i_t_y

_D_e_s_c_r_i_p_t_i_o_n:

     'scat1d' adds tick marks (bar codes. rug plot) on any of the four
     sides of an existing plot, corresponding with non-missing values
     of a vector 'x'.  This is used to show the data density.  Can also
     place the tick marks along a curve by specifying y-coordinates to
     go along with the 'x' values. 

     If any two values of 'x' are within 'eps*w' of each other, where
     'eps' defaults to .001 and 'w' is the span of the intended axis,
     values of 'x' are jittered by adding a value uniformly distributed
     in '[-jitfrac*w, jitfrac*w]', where 'jitfrac' defaults to .008.
     Specifying 'preserve=TRUE' invokes 'jitter2' with a different
     logic of jittering. Allows plotting random sub-segments to handle
     very large 'x' vectors (see 'tfrac').

     'jitter2' is a generic method for jittering, which does not add
     random noise. It retains unique values and ranks, and randomly
     spreads duplicate values at equidistant positions within limits of
     enclosing values. 'jitter2' is especially useful for numeric
     variables with discrete values, like rating scales. Missing values
     are allowed and are returned. Currently implemented methods are
     'jitter2.default' for vectors and 'jitter2.data.frame' which
     returns a data.frame with each numeric column jittered.

     'datadensity' is a generic method used to show data densities in
     more complex situations.  In the Design library there is a
     'datadensity' method for use with 'plot.Design'.  Here, another
     'datadensity' method is defined for data frames.  Depending on the
     'which' argument, some or all of the variables in a data frame
     will be displayed, with 'scat1d' used to display continuous
     variables and, by default, bars used to display frequencies of
     categorical, character, or discrete numeric variables.  For such
     variables, when the total length of value labels exceeds 200, only
     the first few characters from each level are used. By default,
     'datadensity.data.frame' will construct one axis (i.e., one strip)
     per variable in the data frame.  Variable names appear to the left
     of the axes, and the number of missing values (if greater than
     zero) appear to the right of the axes.  An optional 'group'
     variable can be used for stratification, where the different
     strata are depicted using different colors.  If the 'q' vector is
     specified, the desired quantiles (over all 'group's) are displayed
     with solid triangles below each axis.

     When the sample size exceeds 2000 (this value may be modified
     using the 'nhistSpike' argument, 'datadensity' calls 'histSpike'
     instead of 'scat1d' to show the data density for numeric
     variables.  This results in a histogram-like display that makes
     the resulting graphics file much smaller.  In this case,
     'datadensity' uses the 'minf' argument (see below) so that very
     infrequent data values will not be lost on the variable's axis,
     although this will slightly distort the histogram.

     'histSpike' is another method for showing a high-resolution data
     distribution that is particularly good for very large datasets
     (say 'n' > 1000).  By default, 'histSpike' bins the continuous 'x'
     variable into 100 equal-width bins and then computes the frequency
     counts within bins. If 'add=FALSE' (the default), the function
     displays either proportions or frequencies as in a vertical
     histogram.  Instead of bars, spikes are used to depict the
     frequencies.  If 'add=FALSE', the function assumes you are adding
     small density displays that are intended to take up a small amount
     of space in the margins of the overall plot.  The 'frac' argument
     is used as with 'scat1d' to determine the relative length of the
     whole plot that is used to represent the maximum frequency.  No
     jittering is done by 'histSpike'.

     'histSpike' can also graph a kernel density estimate for 'x', or
     add a small density curve to any of 4 sides of an existing plot. 
     When 'y' or 'curve' is specified, the density or spikes are drawn
     with respect to the curve rather than the x-axis.

_U_s_a_g_e:

     scat1d(x, side=3, frac=0.02, jitfrac=0.008, tfrac,
            eps=ifelse(preserve,0,.001),
            lwd=0.1, col=par("col"),
            y=NULL, curve=NULL,
            bottom.align=FALSE,
            preserve=FALSE, fill=1/3, limit=TRUE, nhistSpike=2000, nint=100,
            type=c('proportion','count','density'), grid=FALSE, ...)

     jitter2(x, ...)

     ## Default S3 method:
     jitter2(x, fill=1/3, limit=TRUE, eps=0, presorted=FALSE, ...)

     ## S3 method for class 'data.frame':
     jitter2(x, ...)

     datadensity(object, ...)

     ## S3 method for class 'data.frame':
     datadensity(object, group,
                 which=c("all","continuous","categorical"),
                 method.cat=c("bar","freq"),
                 col.group=1:10,
                 n.unique=10, show.na=TRUE, nint=1, naxes,
                 q, bottom.align=nint>1,
                 cex.axis=sc(.5,.3), cex.var=sc(.8,.3),
                 lmgp=NULL, tck=sc(-.009,-.002),
                 ranges=NULL, labels=NULL, ...)
     # sc(a,b) means default to a if number of axes <= 3, b if >=50, use
     # linear interpolation within 3-50

     histSpike(x, side=1, nint=100, frac=.05, minf=NULL, mult.width=1,
               type=c('proportion','count','density'),
               xlim=range(x), ylim=c(0,max(f)), xlab=deparse(substitute(x)), 
               ylab=switch(type,proportion='Proportion',
                                count     ='Frequency',
                                density   ='Density'),
               y=NULL, curve=NULL, add=FALSE, 
               bottom.align=type=='density', col=par('col'), lwd=par('lwd'),
               grid=FALSE, ...)

_A_r_g_u_m_e_n_t_s:

       x: a vector of numeric data, or a data frame (for 'jitter2') 

  object: a data frame or list (even with unequal number of
          observations per variable, as long as 'group' is not
          specified) 

    side: axis side to use (1=bottom (default for 'histSpike'), 2=left,
           3=top (default for 'scat1d'), 4=right) 

    frac: fraction of smaller of vertical and horizontal axes for tick
          mark lengths. Can be negative to move tick marks outside of
          plot.  For 'histSpike', this is the relative length to be
          used for the largest frequency. When 'scat1d' calls
          'histSpike', it multiplies its 'frac' argument by 2.5. 

 jitfrac: fraction of axis for jittering.  If <=0, no jittering is
          done. If 'preserve=TRUE', the amount of jittering is
          independent of jitfrac. 

   tfrac: fraction of tick mark to actually draw.  If 'tfrac<1', will
          draw a random fraction 'tfrac' of the line segment at each
          point. This is useful for very large samples or ones with
          some very dense points. The default value is 1 if the number
          of non-missing observations 'n' is less than 125, and
          'max(.1, 125/n)' otherwise. 

     eps: fraction of axis for determining overlapping points in 'x'.
          For 'preserve=TRUE' the default is 0 and original unique
          values are retained, bigger values of eps tends to bias
          observations from dense to sparse regions, but ranks are
          still preserved. 

     lwd: line width for tick marks, passed to 'segments' 

     col: color for tick marks, passed to 'segments' 

       y: specify a vector the same length as 'x' to draw tick marks
          along a curve instead of by one of the axes.  The 'y' values
          are often predicted values from a model.  The 'side' argument
          is ignored when 'y' is given.  If the curve is already
          represented as a table look-up, you may specify it using the
          'curve' argument instead.  'y' may be a scalar to use a
          constant vertical placement. 

   curve: a list containing elements 'x' and 'y' for which linear
          interpolation is used to derive 'y' values corresponding to
          values of 'x'.  This results in tick marks being drawn along
          the curve.  For 'histSpike', interpolated 'y' values are
          derived for bin midpoints. 

bottom.align: set to 'TRUE' to have the bottoms of tick marks (for
          'side=1' or 'side=3') aligned at the y-coordinate.  The
          default behavior is to center the tick marks.  For
          'datadensity.data.frame', 'bottom.align' defaults to 'TRUE'
          if 'nint>1'.  In other words, if you are only labeling the
          first and last axis tick mark, the 'scat1d' tick marks are
          centered on the variable's axis. 

preserve: set to 'TRUE' to invoke 'jitter2' 

    fill: maximum fraction of the axis filled by jittered values. If
          'd' are duplicated values between a lower value 'l' and upper
          value 'u', then 'd' will be spread within '+/-
          fill*min(u-d,d-l)/2'. 

   limit: specifies a limit for maximum shift in jittered values.
          Duplicate values will be spread within '+/-
          fill*min(limit,min(u-d,d-l)/2)'. The default 'TRUE' restricts
          jittering to the smallest min(u-d,d-l)/2 observed and results
          in equal amount of jittering for all d. Setting to 'FALSE'
          allows for locally different amount of jittering, using
          maximum space available. 

nhistSpike: If the number of observations exceeds or equals
          'nhistSpike', 'scat1d' will automatically call 'histSpike' to
          draw the data density, to prevent the graphics file from
          being too large. 

    type: used by or passed to 'histSpike'.  Set to '"count"' to
          display frequency counts rather than relative frequencies, or
          '"density"' to display a kernel density estimate computed
          using the 'density' function. 

    grid: set to 'TRUE' if the R 'grid' package is in effect for the
          current plot 

    nint: number of intervals to divide each continuous variable's axis
          for 'datadensity'.  For 'histSpike', is the number of
          equal-width intervals for which to bin 'x', and if instead
          'nint' is a character string (e.g., 'nint="all"'), the
          frequency tabulation is done with no binning.  In other
          words, frequencies for all unique values of 'x' are derived
          and plotted. 

     ...: optional arguments passed to 'scat1d' from 'datadensity' or
          to 'histSpike' from 'scat1d' 

presorted: set to 'TRUE' to prevent from sorting for determining the
          order l<d<u. This is usefull if an existing meaningfull local
          order would be destroyed by sorting, as in
          sin(pi*sort(round(runif(1000,0,10),1))). 

   group: an optional stratification variable, which is converted to a
          'factor' vector if it is not one already 

   which: set 'which="continuous"' to only plot continuous variables,
          or 'which="categorical"' to only plot categorical, character,
          or discrete numeric ones.  By default, all types of variables
          are depicted. 

method.cat: set 'method.cat="freq"' to depict frequencies of
          categorical variables with digits representing the cell
          frequencies, with size proportional to the square root of the
          frequency.  By default, vertical bars are used. 

col.group: colors representing the 'group' strata.  The vector of
          colors is recycled to be the same length as the levels of
          'group'. 

n.unique: number of unique values a numeric variable must have before
          it is considered to be a continuous variable 

 show.na: set to 'FALSE' to suppress drawing the number of 'NA's to the
          right of each axis 

   naxes: number of axes to draw on each page before starting a new
          plot.  You can set 'naxes' larger than the number of
          variables in the data frame if you want to compress the plot
          vertically. 

       q: a vector of quantiles to display.  By default, quantiles are
          not shown. 

cex.axis: character size for draw labels for axis tick marks 

 cex.var: character size for variable names and frequence of 'NA's 

    lmgp: spacing between numeric axis labels and axis (see 'par' for
          'mgp') 

     tck: see 'tck' under 'par' 

  ranges: a list containing ranges for some or all of the numeric
          variables.  If 'ranges' is not given or if a certain variable
          is not found in the list, the empirical range, modified by
          'pretty', is used.  Example: 'ranges=list(age=c(10,100),
          pressure=c(50,150))'. 

  labels: a vector of labels to use in labeling the axes for
          'datadensity.data.frame'.  Default is to use the names of the
          variables in the input data frame.  Note: margin widths
          computed for setting aside names of variables use the names,
          and not these labels. 

    minf: For 'histSpike', if 'minf' is specified low bin frequencies
          are set to a minimum value of 'minf' times the maximum bin
          frequency, so that rare data points will remain visible.  A
          good choice of 'minf' is 0.075.  'datadensity.data.frame'
          passes 'minf=0.075' to 'scat1d' to pass to 'histSpike'.  Note
          that specifying 'minf' will cause the shape of the histogram
          to be distorted somewhat. 

mult.width: multiplier for the smoothing window width computed by
          'histSpike' when 'type="density"' 

    xlim: a 2-vector specifying the outer limits of 'x' for binning
          (and plotting, if 'add=FALSE' and 'nint' is a number) 

    ylim: 'y'-axis range for plotting (if 'add=FALSE') 

    xlab: 'x'-axis label ('add=FALSE'); default is name of input
          argument 'x' 

    ylab: 'y'-axis label ('add=FALSE') 

     add: set to 'TRUE' to add the spike-histogram to an existing plot,
          to show marginal data densities 

_D_e_t_a_i_l_s:

     For 'scat1d' the length of line segments used is
     'frac*min(par()$pin) / par()$uin[opp]' data units, where 'opp' is
     the index of the opposite axis and 'frac' defaults to .02. 
     Assumes that 'plot' has already been called.  Current 'par("usr")'
     is used to determine the range of data for the axis of the current
     plot.  This range is used in jittering and in constructing line
     segments.

_V_a_l_u_e:

     'histSpike' returns the actual range of 'x' used in its binning

_S_i_d_e _E_f_f_e_c_t_s:

     'scat1d' adds line segments to plot.  'datadensity.data.frame'
     draws a complete plot.  'histSpike' draws a complete plot or adds
     to an existing plot.

_A_u_t_h_o_r(_s):

     Frank Harrell 
      Department of Biostatistics 
      Vanderbilt University 
      Charlottesville VA, USA 
      f.harrell@vanderbilt.edu

     Martin Maechler (improved 'scat1d') 
      Seminar fuer Statistik 
      ETH Zurich SWITZERLAND 
      maechler@stat.math.ethz.ch

     Jens Oehlschlaegel-Akiyoshi (wrote 'jitter2') 
      Center for Psychotherapy Research 
      Christian-Belser-Strasse 79a 
      D-70597 Stuttgart Germany 
      oehl@psyres-stuttgart.de

_S_e_e _A_l_s_o:

     'segments', 'jitter', 'rug', 'plsmo', 'stripplot', 
     'hist.data.frame','ecdf', 'hist', 'histogram', 'table', 'density'

_E_x_a_m_p_l_e_s:

     plot(x <- rnorm(50), y <- 3*x + rnorm(50)/2 )
     scat1d(x)                 # density bars on top of graph
     scat1d(y, 4)              # density bars at right
     histSpike(x, add=TRUE)       # histogram instead, 100 bins
     histSpike(y, 4, add=TRUE)
     histSpike(x, type='density', add=TRUE)  # smooth density at bottom
     histSpike(y, 4, type='density', add=TRUE)

     smooth <- lowess(x, y)    # add nonparametric regression curve
     lines(smooth)             # Note: plsmo() does this
     scat1d(x, y=approx(smooth, xout=x)$y) # data density on curve
     scat1d(x, curve=smooth)   # same effect as previous command
     histSpike(x, curve=smooth, add=TRUE) # same as previous but with histogram
     histSpike(x, curve=smooth, type='density', add=TRUE)  
     # same but smooth density over curve

     plot(x <- rnorm(250), y <- 3*x + rnorm(250)/2)
     scat1d(x, tfrac=0)        # dots randomly spaced from axis
     scat1d(y, 4, frac=-.03)   # bars outside axis
     scat1d(y, 2, tfrac=.2)    # same bars with smaller random fraction

     x <- c(0:3,rep(4,3),5,rep(7,10),9)
     plot(x, jitter2(x))       # original versus jittered values
     abline(0,1)               # unique values unjittered on abline
     points(x+0.1, jitter2(x, limit=FALSE), col=2)
                               # allow locally maximum jittering
     points(x+0.2, jitter2(x, fill=1), col=3); abline(h=seq(0.5,9,1), lty=2)
                               # fill 3/3 instead of 1/3
     x <- rnorm(200,0,2)+1; y <- x^2
     x2 <- round((x+rnorm(200))/2)*2
     x3 <- round((x+rnorm(200))/4)*4
     dfram <- data.frame(y,x,x2,x3)
     plot(dfram$x2, dfram$y)   # jitter2 via scat1d
     scat1d(dfram$x2, y=dfram$y, preserve=TRUE, col=2)
     scat1d(dfram$x2, preserve=TRUE, frac=-0.02, col=2)
     scat1d(dfram$y, 4, preserve=TRUE, frac=-0.02, col=2)

     pairs(jitter2(dfram))     # pairs for jittered data.frame
     # This gets reasonable pairwise scatter plots for all combinations of
     # variables where
     #
     # - continuous variables (with unique values) are not jittered at all, thus
     #   all relations between continuous variables are shown as they are,
     #   extreme values have exact positions.
     #
     # - discrete variables get a reasonable amount of jittering, whether they
     #   have 2, 3, 5, 10, 20 ... levels
     #
     # - different from adding noise, jitter2() will use the available space
     #   optimally and no value will randomly mask another
     #
     # If you want a scatterplot with lowess smooths on the *exact* values and
     # the point clouds shown jittered, you just need
     #
     pairs( dfram ,panel=function(x,y) { points(jitter2(x),jitter2(y))
                                         lines(lowess(x,y)) } )



     datadensity(dfram)     # graphical snapshot of entire data frame
     datadensity(dfram, group=cut2(dfram$x2,g=3))
                               # stratify points and frequencies by
                               # x2 tertiles and use 3 colors

     # datadensity.data.frame(split(x, grouping.variable))
     # need to explicitly invoke datadensity.data.frame when the
     # first argument is a list

