ecdf                  package:Hmisc                  R Documentation

_E_m_p_i_r_i_c_a_l _C_u_m_u_l_a_t_i_v_e _D_i_s_t_r_i_b_u_t_i_o_n _P_l_o_t

_D_e_s_c_r_i_p_t_i_o_n:

     Computes coordinates of cumulative distribution function of x, and
     by defaults plots it as a step function.  A grouping variable may
     be specified so that stratified estimates are computed and (by
     default) plotted.  If there is more than one group, the 'labcurve'
     function is used (by default) to label the multiple step functions
     or to draw a legend defining line types, colors, or symbols by
     linking them with group labels.  A 'weights' vector may be
     specified to get weighted estimates.  Specify 'normwt' to make
     'weights' sum to the length of 'x' (after removing NAs).  Other
     wise the total sample size is taken to be the sum of the weights.

     'ecdf' is actually a method, and 'ecdf.default' is what's called
     for a vector argument.  'ecdf.data.frame' is called when the first
     argument is a data frame.  This function can automatically set up
     a matrix of ECDFs and wait for a mouse click if the matrix
     requires more than one page.  Categorical variables, character
     variables, and variables having fewer than a set number of unique
     values are ignored. If 'par(mfrow=..)' is not set up before
     'ecdf.data.frame' is called, the function will try to figure the
     best layout depending on the number of variables in the data
     frame.  Upon return the original 'mfrow' is left intact.

     When the first argument to 'ecdf' is a formula, a Trellis/Lattice
     function 'ecdf.formula' is called.  This allows for multi-panel
     conditioning, superposition using a 'groups' variable, and other
     Trellis features, along with the ability to easily plot
     transformed ECDFs using the 'fun' argument.  For example, if
     'fun=qnorm', the inverse normal transformation will be used for
     the y-axis.  If the transformed curves are linear this indicates
     normality.  Like the 'xYplot' function, 'ecdf' will create a
     function 'Key' if the 'groups' variable is used.  This function
     can be invoked by the user to define the keys for the groups.

_U_s_a_g_e:

     ecdf(x, ...)

     ## Default S3 method:
     ecdf(x, what=c('F','1-F','f'), weights, normwt=FALSE,
          xlab, ylab, q, pl=TRUE, add=FALSE, lty=1, 
          col=1, group=rep(1,length(x)), label.curves=TRUE, xlim, 
          subtitles=TRUE, datadensity=c('none','rug','hist','density'),
          side=1, 
          frac=switch(datadensity,none=NA,rug=.03,hist=.1,density=.1),
          dens.opts=NULL, lwd, ...)

     ## S3 method for class 'data.frame':
     ecdf(x, group=rep(1,nrows), weights, normwt,
          label.curves=TRUE, n.unique=10, na.big=FALSE, subtitles=TRUE, 
          vnames=c('labels','names'),...)

     ## S3 method for class 'formula':
     ecdf(x, data, groups, prepanel=prepanel.ecdf,
          panel=panel.ecdf, ..., xlab, ylab, fun=function(x)x, subset=TRUE)

_A_r_g_u_m_e_n_t_s:

       x: a numeric vector, data frame, or Trellis/Lattice formula

    what: The default is '"F"' which results in plotting the fraction
          of values <= x.  Set to '"1-F"' to plot the fraction > x or
          '"f"' to plot the cumulative frequency of values <= x. 

 weights: numeric vector of weights.  Omit or specify a zero-length
          vector or NULL to get unweighted estimates. 

  normwt: see above 

    xlab: x-axis label.  Default is label(x) or name of calling
          argument.  For 'ecdf.formula', 'xlab' defaults to the 'label'
          attribute of the x-axis variable. 

    ylab: y-axis label.  Default is '"Proportion <= x"', '"Proportion >
          x"',  or "Frequency <= x" depending on value of 'what'. 

       q: a vector for quantiles for which to draw reference lines on
          the plot. Default is not to draw any. 

      pl: set to F to omit the plot, to just return estimates. 

     add: set toTRUE to add the cdf to an existing plot. 

     lty: integer line type for plot.  If 'group' is specified, this
          can be a vector. 

     lwd: line width for plot.  Can be a vector corresponding to
          'group's. 

     col: color for step function.  Can be a vector. 

   group: a numeric, character, or 'factor' categorical variable used
          for stratifying estimates.  If 'group' is present, as many
          ECDFs are drawn as there are non-missing group levels. 

label.curves: applies if more than one 'group' exists. Default is
          'TRUE' to use 'labcurve' to label curves where they are
          farthest apart.  Set 'label.curves' to a 'list' to specify
          options to 'labcurve', e.g.,
          'label.curves=list(method="arrow", cex=.8)'. These option
          names may be abbreviated in the usual way arguments are
          abbreviated.  Use for example 'label.curves=list(keys=1:5)'
          to draw symbols periodically (as in 'pch=1:5' - see 'points')
          on the curves and automatically position a legend in the most
          empty part of the plot.  Set 'label.curves=FALSE' to suppress
          drawing curve labels.  The 'col', 'lty', and 'type'
          parameters are automatically passed to 'labcurve', although
          you can override them here.  You can set
          'label.curves=list(keys="lines")' to have different line
          types defined in an automatically positioned key. 

    xlim: x-axis limits.  Default is entire range of 'x'. 

subtitles: set to 'FALSE' to suppress putting a subtitle at the bottom
          left of each plot.  The subtitle indicates the numbers of
          non-missing and missing observations, which are labeled 'n',
          'm'. 

datadensity: If 'datadensity' is not '"none"', either 'scat1d' or
          'histSpike' is called to add a rug plot
          ('datadensity="rug"'), spike histogram
          ('datadensity="hist"'), or smooth density estimate
          ('"density"') to the bottom or top of the ECDF. 

    side: If 'datadensity' is not '"none"', the default is to place the
          additional information on top of the x-axis ('side=1').  Use
          'side=3' to place at the top of the graph. 

    frac: passed to 'histSpike' 

dens.opts: a list of optional arguments for 'histSpike' 

     ...: other parameters passed to plot if add=F.  For data frames,
          other parameters to pass to 'ecdf.default'. For
          'ecdf.formula', if 'groups' is not used, you can also add
          data density information to each panel's ECDF by specifying
          the 'datadensity' and optional 'frac', 'side', 'dens.opts'
          arguments.  

n.unique: minimum number of unique values before an ECDF is drawn for a
          variable in a data frame.  Default is 10. 

  na.big: set to 'TRUE' to draw the number of NAs in larger letters in
          the middle of the plot for 'ecdf.data.frame' 

  vnames: By default, variable labels are used to label x-axes.  Set
          'vnames="names"' to instead use variable names. 

  method: method for computing the empirical cumulative distribution. 
          See 'wtd.ecdf'.  The default is to use the standard '"i/n"'
          method as is used by the non-Trellis versions of 'ecdf'. 

     fun: a function to transform the cumulative proportions, for the
          Trellis-type usage of 'ecdf' 

    data: 

  groups: 

  subset: 

prepanel: 

   panel: the usual Trellis/Lattice parameters, with 'groups' causing
          'ecdf.formula' to overlay multiple ECDFs on one panel.

_V_a_l_u_e:

     for 'ecdf.default' an invisible list with elements x and y giving
     the coordinates of the cdf.  If there is more than one 'group', a
     list of such lists is returned.  An attribute, 'N', is in the
     returned object.  It contains the elements 'n' and 'm', the number
     of non-missing and missing observations, respectively.

_S_i_d_e _E_f_f_e_c_t_s:

     plots

_A_u_t_h_o_r(_s):

     Frank Harrell 
      Department of Biostatistics, Vanderbilt University 
      f.harrell@vanderbilt.edu

_S_e_e _A_l_s_o:

     'wtd.ecdf', 'label', 'table', 'cumsum', 'labcurve', 'xYplot',
     'histSpike'

_E_x_a_m_p_l_e_s:

     set.seed(1)
     ch <- rnorm(1000, 200, 40)
     ecdf(ch, xlab="Serum Cholesterol")
     scat1d(ch)                       # add rug plot
     histSpike(ch, add=TRUE, frac=.15)   # add spike histogram
     # Better: add a data density display automatically:
     ecdf(ch, datadensity='density')

     label(ch) <- "Serum Cholesterol"
     ecdf(ch)
     other.ch <- rnorm(500, 220, 20)
     ecdf(other.ch,add=TRUE,lty=2)

     sex <- factor(sample(c('female','male'), 1000, TRUE))
     ecdf(ch, q=c(.25,.5,.75))  # show quartiles
     ecdf(ch, group=sex,
          label.curves=list(method='arrow'))

     # Example showing how to draw multiple ECDFs from paired data
     pre.test <- rnorm(100,50,10)
     post.test <- rnorm(100,55,10)
     x <- c(pre.test, post.test)
     g <- c(rep('Pre',length(pre.test)),rep('Post',length(post.test)))
     ecdf(x, group=g, xlab='Test Results', label.curves=list(keys=1:2))
     # keys=1:2 causes symbols to be drawn periodically on top of curves

     # Draw a matrix of ECDFs for a data frame
     m <- data.frame(pre.test, post.test, 
                     sex=sample(c('male','female'),100,TRUE))
     ecdf(m, group=m$sex, datadensity='rug')

     freqs <- sample(1:10, 1000, TRUE)
     ecdf(ch, weights=freqs)  # weighted estimates

     # Trellis/Lattice examples:

     region <- factor(sample(c('Europe','USA','Australia'),100,TRUE))
     year <- factor(sample(2001:2002,1000,TRUE))
     ecdf(~ch | region*year, groups=sex)
     Key()           # draw a key for sex at the default location
     # Key(locator(1)) # user-specified positioning of key
     age <- rnorm(1000, 50, 10)
     ecdf(~ch | equal.count(age), groups=sex)  # use overlapping shingles
     ecdf(~ch | sex, datadensity='hist', side=3)  # add spike histogram at top

