fixreg                  package:fpc                  R Documentation

_L_i_n_e_a_r _R_e_g_r_e_s_s_i_o_n _F_i_x_e_d _P_o_i_n_t _C_l_u_s_t_e_r_s

_D_e_s_c_r_i_p_t_i_o_n:

     Computes linear regression fixed point clusters (FPCs), i.e.,
     subsets of the data, which consist exactly of the non-outliers
     w.r.t. themselves, and may be interpreted as generated from a
     homogeneous linear regression relation between independent and
     dependent variable.  FPCs may overlap, are not necessarily
     exhausting and do not need a specification of the number of
     clusters.

     Note that while 'fixreg' has lots of parameters, only one (or few)
     of them have usually to be specified, cf. the examples. The
     philosophy is to allow much flexibility, but to always provide 
     sensible defaults.

_U_s_a_g_e:

     fixreg(indep=rep(1,n), dep, n=length(dep),
                         p=ncol(as.matrix(indep)),
                         ca=NA, mnc=NA, mtf=3, ir=NA, irnc=NA,
                         irprob=0.95, mncprob=0.5, maxir=20000, maxit=5*n,
                         distcut=0.85, init.group=list(), 
                         ind.storage=FALSE, countmode=100, 
                         plot=FALSE)

     ## S3 method for class 'rfpc':
     summary(object, ...)

     ## S3 method for class 'summary.rfpc':
     print(x, maxnc=30, ...)

     ## S3 method for class 'rfpc':
     plot(x, indep=rep(1,n), dep, no, bw=TRUE,
                           main=c("Representative FPC No. ",no),
                           xlab="Linear combination of independents",
                           ylab=deparse(substitute(indep)),
                           xlim=NULL, ylim=range(dep), 
                           pch=NULL, col=NULL,...)

     ## S3 method for class 'rfpc':
     fpclusters(object, indep=NA, dep=NA, ca=object$ca, ...)

     rfpi(indep, dep, p, gv, ca, maxit, plot) 

_A_r_g_u_m_e_n_t_s:

   indep: numerical matrix or vector. Independent variables. Leave out
          for clustering one-dimensional data. 'fpclusters.rfpc' does
          not need specification of 'indep' if 'fixreg' was run with
          'ind.storage=TRUE'.

     dep: numerical vector. Dependent variable. 'fpclusters.rfpc' does
          not need specification of 'dep' if 'fixreg' was run with
          'ind.storage=TRUE'.

       n: optional positive integer. Number of cases.

       p: optional positive integer. Number of independent variables.

      ca: optional positive number. Tuning constant, specifying
          required cluster separation. By default determined
          automatically as a function of 'n' and 'p', see function
          'can', Hennig (2002a).

     mnc: optional positive integer. Minimum size of clusters to be
          reported. By default determined automatically as a function
          of 'mncprob'. See Hennig (2002a).

     mtf: optional positive integer. FPCs must be found at least 'mtf'
          times to be reported by 'summary.rfpc'.

      ir: optional positive integer. Number of algorithm runs. By
          default determined automatically as a function of 'n', 'p',
          'irnc', 'irprob', 'mtf', 'maxir'. See function 'itnumber' and
          Hennig (2002a).

    irnc: optional positive integer. Size of the smallest cluster to be
          found with approximated probability 'irprob'.

  irprob: optional value between 0 and 1. Approximated probability for
          a cluster of size 'irnc' to be found.

 mncprob: optional value between 0 amd 1. Approximated probability for
          a cluster of size 'mnc' to be found.

   maxir: optional integer. Maximum number of algorithm runs.

   maxit: optional integer. Maximum number of iterations per algorithm
          run (usually an FPC is found much earlier).

 distcut: optional value between 0 and 1. A similarity measure between
          FPCs, given in Hennig (2002a), and the corresponding Single
          Linkage groups of FPCs with similarity larger than 'distcut'
          are computed. A single representative FPC is selected for
          each group.

init.group: optional list of logical vectors of length 'n'. Every
          vector indicates a starting configuration for the fixed point
          algorithm. This can be used for datasets with high dimension,
          where the vectors of 'init.group' indicate cluster candidates
          found by graphical inspection or background knowledge.

ind.storage: optional logical. If 'TRUE', then all indicator vectors of
          found FPCs are given in the value of 'fixreg'. May need lots
          of memory, but is a bit faster.

countmode: optional positive integer. Every 'countmode' algorithm runs
          'fixreg' shows a message.

    plot: optional logical. If 'TRUE', you get a scatterplot of first
          independent vs. dependent variable at each iteration.

  object: object of class 'rfpc', output of 'fixreg'.

       x: object of class 'rfpc', output of 'fixreg'.

   maxnc: positive integer. Maximum number of FPCs to be reported.

      no: positive integer. Number of the representative FPC to be
          plotted.

      bw: optional logical. If 'TRUE', plot is black/white, FPC is
          indicated by different symbol. Else FPC is indicated red.

    main: plot title.

    xlab: label for x-axis.

    ylab: label for y-axis.

    xlim: plotted range of x-axis. If 'NULL', the range of the plotted
          linear combination of independent variables is used.

    ylim: plotted range of y-axis.

     pch: plotting symbol, see 'par'. If 'NULL', the default is used.

     col: plotting color, see 'par'. If 'NULL', the default is used.

      gv: logical vector of length 'n'. Indicates the initial
          configuration for the fixed point algorithm.

     ...: additional parameters to be passed to 'plot' (no effects
          elsewhere).

_D_e_t_a_i_l_s:

     A linear regression FPC is a data subset  that reproduces itself
     under the following operation: 
      Compute linear regression and error variance estimator for the
     data subset, and compute all points of the dataset for which the
     squared residual is smaller than 'ca' times the error variance.
      Fixed points of this operation can be considered as clusters,
     because they contain only non-outliers (as defined by the above
     mentioned procedure) and all other points are outliers w.r.t. the
     subset. 
      'fixreg' performs 'ir' fixed point algorithms started from random
     subsets of size 'p+2' to look for FPCs. Additionally an algorithm
     is started from the whole dataset, and algorithms are started from
     the subsets specified in 'init.group'. 
      Usually some of the FPCs are unstable, and more than one FPC may
     correspond to the same significant pattern in the data. Therefore
     the number of FPCs is reduced: FPCs with less than 'mnc' points
     are ignored. Then a similarity matrix is computed between the
     remaining FPCs. Similarity between sets is defined as the ratio
     between 2 times size of intersection and the sum of sizes of both
     sets. The Single Linkage clusters (groups) of level 'distcut' are
     computed, i.e. the connectivity components of the graph where
     edges are drawn between FPCs with similarity larger than
     'distcut'. Groups of FPCs whose members are found 'mtf' times or
     more are considered as stable enough. A representative FPC is
     chosen for every Single Linkage cluster of FPCs according to the
     maximum expectation ratio 'ser'. 'ser' is the ratio between the
     number of findings of an FPC and the estimated expectation of the
     number of findings of an FPC of this size, called _expectation
     ratio_ and computed by 'clusexpect'.
      Usually only the representative FPCs of stable groups are of
     interest. 
      The choice of the involved tuning constants such as 'ca' and 'ir'
     is discussed in detail in Hennig (2002a). Statistical theory is
     presented in Hennig (2003).
      Generally, the default settings are recommended for 'fixreg'. In
     cases where they lead to a too large number of algorithm runs
     (e.g., always for 'p>4'), the use of 'init.group' together with
     'mtf=1' and 'ir=0' is useful. Occasionally, 'irnc' may be chosen
     smaller than the default, if smaller clusters are of interest, but
     this may lead to too many clusters and too many algorithm runs.
     Decrease of 'ca' will often lead to too many clusters, even for
     homogeneous data. Increase of 'ca' will produce only very strongly
     separated clusters. Both may be of interest occasionally.

     'rfpi' is called by 'fixreg' for a single fixed point algorithm
     and will usually not be executed alone.

     'summary.rfpc' gives a summary about the representative FPCs of
     stable groups.

     'plot.rfpc' is a plot method for the representative FPC of stable
     group  no. 'no'. It produces a scatterplot of the linear
     combination of independent variables determined by the regression
     coefficients of the FPC vs. the dependent variable. The regression
     line and the region of non-outliers determined by 'ca' are plotted
     as well.

     'fpclusters.rfpc' produces a list of indicator vectors for the
     representative FPCs of stable groups.

_V_a_l_u_e:

     'fixreg' returns an object of class 'rfpc'. This is a list
     containing the components 'nc, g, coefs, vars, nfound, er, tsc,
     ncoll, grto, imatrix, smatrix, stn, stfound, sfpc, ssig, sto,
     struc, n, p, ca, ir, mnc, mtf, distcut'.

     'summary.rfpc' returns an object of class 'summary.rfpc'. This is
     a list containing the components 'coefs, vars, stfound, stn, sn,
     ser, tsc, sim, ca, ir, mnc, mtf'.

     'fpclusters.rfpc' returns a list of indicator vectors for the
     representative FPCs of stable groups.

     'rfpi' returns a list with the components 'coef, var, g, coll,
     ca'.

      nc: integer. Number of FPCs.

       g: list of logical vectors. Indicator vectors of FPCs. 'FALSE'
          if 'ind.storage=FALSE'.

   coefs: list of numerical vectors. Regression coefficients of FPCs.
          In 'summary.rfpc', only for representative FPCs of stable
          groups and sorted according to 'stfound'.

    vars: list of numbers. Error variances of FPCs. In 'summary.rfpc',
          only for representative FPCs of stable groups and sorted
          according to 'stfound'.

  nfound: vector of integers. Number of findings for the FPCs.

      er: numerical vector. Expectation ratios of FPCs. Can be taken as
          a stability measure.

     tsc: integer. Number of algorithm runs leading to too small or too
          seldom found FPCs.

   ncoll: integer. Number of algorithm runs where collinear regressor
          matrices occurred.

    grto: vector of integers. Numbers of FPCs to which algorithm runs
          led, which were started by 'init.group'.

 imatrix: vector of integers. Size of intersection between FPCs. See
          'sseg'.

 smatrix: numerical vector. Similarities between FPCs. See 'sseg'.

     stn: integer. Number of representative FPCs of stable groups. In
          'summary.rfpc' sorted according to 'stfound'.

 stfound: vector of integers. Number of findings of members of all
          groups of FPCs. In 'summary.rfpc' sorted according to
          'stfound'.

    sfpc: vector of integers. Numbers of representative FPCs.

    ssig: vector of integers. As 'sfpc', but only for stable groups.

     sto: vector of integers. Number of representative FPC of most, 2nd
          most, ..., often found group of FPCs.

   struc: vector of integers. Number of group an FPC belongs to.

       n: see arguments.

       p: see arguments.

      ca: see arguments.

      ir: see arguments.

     mnc: see arguments.

     mtf: see arguments.

 distcut: see arguments.

      sn: vector of integers. Number of points of representative FPCs.

     ser: numerical vector. Expectation ratio for stable groups.

     sim: vector of integers. Size of intersections between
          representative FPCs of stable groups. See 'sseg'.

    coef: vector of regression coefficients.

     var: error variance.

       g: logical indicator vector of iterated FPC.

    coll: logical. 'TRUE' means that singular covariance matrices
          occurred during the iterations.

_A_u_t_h_o_r(_s):

     Christian Hennig chrish@stats.ucl.ac.uk <URL:
     http://www.homepages.ucl.ac.uk/~ucakche/>

_R_e_f_e_r_e_n_c_e_s:

     Hennig, C. (2002) Fixed point clusters for linear regression:
     computation and comparison, _Journal of Classification_ 19,
     249-276.

     Hennig, C. (2003) Clusters, outliers and regression: fixed point
     clusters, _Journal of Multivariate Analysis_ 86, 183-212.

_S_e_e _A_l_s_o:

     'fixmahal' for fixed point clusters in the usual setup
     (non-regression).

     'regmix' for clusterwiese linear regression by mixture modeling
     ML.

     'can', 'itnumber' for computation of the default settings.  

     'clusexpect' for estimation of the expected number of findings of
     an FPC of given size.

     'itnumber' for the generation of the number of fixed point
     algorithms.

     'minsize' for the smallest FPC size to be found with a given
     probability..

     'sseg' for indexing the similarity/intersection vectors computed
     by 'fixreg'.

_E_x_a_m_p_l_e_s:

     set.seed(190000)
     data(tonedata)
     # Note: If you do not use the installed package, replace this by
     # tonedata <- read.table("(path/)tonedata.txt", header=TRUE)
     attach(tonedata)
     tonefix <- fixreg(stretchratio,tuned,mtf=1,ir=20)
     summary(tonefix)
     # This is designed to have a fast example; default setting would be better.
     # If you want to see more (and you have a bit more time),
     # try out the following:
     # set.seed(1000)
     # tonefix <- fixreg(stretchratio,tuned)
     ## Default - good for these data
     # summary(tonefix)
     # plot(tonefix,stretchratio,tuned,1)
     # plot(tonefix,stretchratio,tuned,2)
     # plot(tonefix,stretchratio,tuned,3,bw=FALSE,pch=5) 
     # toneclus <- fpclusters(tonefix,stretchratio,tuned)
     # plot(stretchratio,tuned,col=1+toneclus[[2]])
     # tonefix2 <- fixreg(stretchratio,tuned,distcut=1,mtf=1,countmode=50)
     ## Every found fixed point cluster is reported,
     ## no matter how instable it may be.
     # summary(tonefix2)
     # tonefix3 <- fixreg(stretchratio,tuned,ca=7)
     ## ca defaults to 10.07 for these data.
     # summary(tonefix3)
     # subset <- c(rep(FALSE,5),rep(TRUE,24),rep(FALSE,121))
     # tonefix4 <- fixreg(stretchratio,tuned,
     #                    mtf=1,ir=0,init.group=list(subset))
     # summary(tonefix4)

