fixmahal                 package:fpc                 R Documentation

_M_a_h_a_l_a_n_o_b_i_s _F_i_x_e_d _P_o_i_n_t _C_l_u_s_t_e_r_s

_D_e_s_c_r_i_p_t_i_o_n:

     Computes Mahalanobis fixed point clusters (FPCs), i.e., subsets of
     the data, which consist exactly of the non-outliers w.r.t.
     themselves, and may be interpreted as generated from a homogeneous
     normal population. FPCs may overlap, are not necessarily
     exhausting and do not need a specification of the number of
     clusters.

     Note that while 'fixmahal' has lots of parameters, only one (or
     few) of them have usually to be specified, cf. the examples. The
     philosophy is to allow much flexibility, but to always provide 
     sensible defaults.

_U_s_a_g_e:

     fixmahal(dat, n = nrow(as.matrix(dat)), p = ncol(as.matrix(dat)), 
                           method = "fuzzy", cgen = "fixed",
                           ca = NA, ca2 = NA,
                           calpha = ifelse(method=="fuzzy",0.95,0.99),
                           calpha2 = 0.995,
                           pointit = TRUE, subset = n,
                           nc1 = 100+20*p,
                           startn = 18+p, mnc = floor(startn/2), 
                           mer = ifelse(pointit,0.1,0), 
                           distcut = 0.85, maxit = 5*n, iter = n*1e-5,
                           init.group = list(), 
                           ind.storage = TRUE, countmode = 100, 
                           plot = "none")

     ## S3 method for class 'mfpc':
     summary(object, ...)

     ## S3 method for class 'summary.mfpc':
     print(x, maxnc=30, ...)

     ## S3 method for class 'mfpc':
     plot(x, dat, no, bw=FALSE, main=c("Representative FPC No. ",no),
                         xlab=NULL, ylab=NULL,
                         pch=NULL, col=NULL, ...)

     ## S3 method for class 'mfpc':
     fpclusters(object, dat=NA, ca=object$ca, p=object$p, ...)

     fpmi(dat, n = nrow(as.matrix(dat)), p = ncol(as.matrix(dat)),
                       gv, ca, ca2, method = "ml", plot,
                       maxit = 5*n, iter = n*1e-6) 

_A_r_g_u_m_e_n_t_s:

     dat: something that can be coerced to a numerical matrix or
          vector. Data matrix, rows are points, columns are variables.
          'fpclusters.rfpc' does not need specification of 'dat' if
          'fixmahal' has been run with 'ind.storage=TRUE'.

       n: optional positive integer. Number of cases.

       p: optional positive integer. Number of independent variables.

  method: a string. 'method="classical"' means 0-1 weighting of
          observations by Mahalanobis distances and use of the
          classical normal covariance estimator. 'method="ml"' uses the
          ML-covariance estimator (division by 'n' instead of 'n-1') 
          This is used in Hennig and Christlieb (2002). 'method' can
          also be '"mcd"' or '"mve"', to enforce the use of robust
          centers and covariance matrices, see 'cov.rob'. This is
          experimental, not recommended at the moment, may be very
          slowly and requires library 'lqs'. The default is
          'method="fuzzy"', where weighted means and covariance
          matrices are used (Hennig, 2005). The weights are computed by
          'wfu', i.e., a function that is constant 1 for arguments
          smaller than 'ca', 0 for arguments larger than 'ca2' and
          continuously linear in between. Convergence is only proven
          for 'method="ml"' up to now.

    cgen: optional string. '"fixed"' means that the same tuning
          constant 'ca' is used for all iterations. '"auto"' means that
          'ca' is generated dependently on the size of the current data
          subset in each iteration by 'cmahal'. This is  experimental.

      ca: optional positive number. Tuning constant, specifying
          required cluster separation. By default determined as
          'calpha'-quantile of the chisquared distribution with 'p'
          degrees of freedom.

     ca2: optional positive number. Second tuning constant needed if
          'method="fuzzy"'. By default determined as 'calpha2'-quantile
          of the chisquared distribution with 'p' degrees of freedom.

  calpha: number between 0 and 1. See 'ca'.

 calpha2: number between 0 and 1, larger than 'calpha'. See 'ca2'.

 pointit: optional logical. If 'TRUE', 'subset' fixed point algorithms
          are started from initial configurations, which are built
          around single points of the dataset, cf. 'mahalconf'.
          Otherwise, initial configurations are only specified by
          'init.group'.

  subset: optional positive integer smaller or equal than 'n'. Initial
          configurations for the fixed point algorithm (cf.
          'mahalconf') are built from a subset of 'subset' points from
          the data. No effect if 'pointit=FALSE'. Default: all points.

     nc1: optional positive integer. Tuning constant needed by 'cmahal'
          to generate 'ca' automatically. Only  needed for
          'cgen="auto"'.

  startn: optional positive integer. Size of the initial
          configurations. The default value is chosen to prevent that
          small  meaningless FPCs are found, but it should be decreased
          if  clusters of size smaller than the default value are of
          interest.

     mnc: optional positive integer. Minimum size of clusters to be
          reported.

     mer: optional nonnegative number. FPCs (groups of them,
          respectively, see details) are only reported as stable if the
          ratio  of the number of their findings to their number of
          points exceeds 'mer'. This holds under 'pointit=TRUE' and
          'subset=n'. For 'subset<n', the ratio is adjusted, but for
          small 'subset', the results may extremely vary and have to be
          taken with care.

 distcut: optional value between 0 and 1. A similarity measure between
          FPCs, given in Hennig (2002), and the corresponding Single
          Linkage groups of FPCs with similarity larger than 'distcut'
          are computed. A single representative FPC is selected for
          each group.

   maxit: optional integer. Maximum number of iterations per algorithm
          run (usually an FPC is found much earlier).

    iter: positive number. Algorithm stops when difference between
          subsequent weight vectors is smaller than 'iter'. Only needed
          for 'method="fuzzy"'.

init.group: optional list of logical vectors of length 'n'. Every
          vector indicates a starting configuration for the fixed point
          algorithm. This can be used for datasets with high dimension,
          where the vectors of 'init.group' indicate cluster candidates
          found by graphical inspection or background knowledge, as in
          Hennig and Christlieb (2002).

ind.storage: optional logical. If 'TRUE', then all indicator vectors of
          found FPCs are given in the value of 'fixmahal'. May need
          lots of memory, but is a bit faster.

countmode: optional positive integer. Every 'countmode' algorithm runs
          'fixmahal' shows a message.

    plot: optional string. If '"start"', you get a scatterplot of the
          first two variables to highlight the initial configuration,
          '"iteration"' generates such a plot at each iteration,
          '"both"' does both (this may be very time consuming). The
          default is '"none"'.

  object: object of class 'mfpc', output of 'fixmahal'.

       x: object of class 'mfpc', output of 'fixmahal'.

   maxnc: positive integer. Maximum number of FPCs to be reported.

      no: positive integer. Number of the representative FPC to be
          plotted.

      bw: optional logical. If 'TRUE', plot is black/white, FPC is
          indicated by different symbol. Else FPC is indicated red.

    main: plot title.

    xlab: label for x-axis. If 'NULL', a default text is used.

    ylab: label for y-axis. If 'NULL', a default text is used.

     pch: plotting symbol, see 'par'. If 'NULL', the default is used.

     col: plotting color, see 'par'. If 'NULL', the default is used.

      gv: logical vector (or, with 'method="fuzzy"', vector of weights
          between 0 and 1) of length 'n'. Indicates the initial
          configuration for the fixed point algorithm.

     ...: additional parameters to be passed to 'plot' (no effects
          elsewhere).

_D_e_t_a_i_l_s:

     A (crisp) Mahalanobis FPC is a data subset  that reproduces itself
     under the following operation: 
      Compute mean and covariance matrix estimator for the data subset,
     and compute all points of the dataset for which the squared
     Mahalanobis distance is smaller than 'ca'.
      Fixed points of this operation can be considered as clusters,
     because they contain only non-outliers (as defined by the above
     mentioned procedure) and all other points are outliers w.r.t. the
     subset. 
      The current default is to compute fuzzy Mahalanobis FPCs, where
     the points in the subset have a membership weight between 0 and 1
     and give rise to weighted means and covariance matrices. The new
     weights are then obtained by computing the weight function 'wfu'
     of the squared Mahalanobis distances, i.e., full weight for
     squared distances smaller than 'ca', zero weight for squared
     distances larger than 'ca2' and  decreasing weights (linear
     function of squared distances) in between.
      A fixed point algorithm is started from the whole dataset,
     algorithms are started from the subsets specified in 'init.group',
     and further algorithms are started from further initial
     configurations as explained under 'subset' and in the function
     'mahalconf'. 
      Usually some of the FPCs are unstable, and more than one FPC may
     correspond to the same significant pattern in the data. Therefore
     the number of FPCs is reduced: A similarity matrix is computed
     between FPCs. Similarity between sets is defined as the ratio
     between 2 times size of intersection and the sum of sizes of both
     sets. The Single Linkage clusters (groups) of level 'distcut' are
     computed, i.e. the connectivity components of the graph where
     edges are drawn between FPCs with similarity larger than
     'distcut'. Groups of FPCs whose members are found often enough
     (cf. parameter 'mer') are considered as stable enough. A
     representative FPC is chosen for every Single Linkage cluster of
     FPCs according to the maximum expectation ratio 'ser'. 'ser' is
     the ratio between the number of findings of an FPC and the number
     of points of an FPC, adjusted suitably if 'subset<n'. Usually only
     the representative FPCs of stable groups are of interest. 
      Default tuning constants are taken from Hennig (2005).
      Generally, the default settings are recommended for 'fixmahal'.
     For large datasets, the use of 'init.group' together with
     'pointit=FALSE' is useful. Occasionally, 'mnc' and 'startn' may be
     chosen smaller than the default, if smaller clusters are of
     interest, but this may lead to too many clusters. Decrease of 'ca'
     will often lead to too many clusters, even for homogeneous data.
     Increase of 'ca' will produce only very strongly separated
     clusters. Both may be of interest occasionally.
      Singular covariance matrices during the iterations are handled by
     'solvecov'.

     'summary.mfpc' gives a summary about the representative FPCs of
     stable groups.

     'plot.mfpc' is a plot method for the representative FPC of stable
     group no. 'no'. It produces a scatterplot, where the points
     belonging to the FPC are highlighted, the mean is and for 'p<3'
     also the region of the FPC is shown. For 'p>=3', the optimal
     separating projection computed by 'batcoord' is shown.  

     'fpclusters.mfpc' produces a list of indicator vectors for the
     representative FPCs of stable groups.

     'fpmi' is called by 'fixmahal' for a single fixed point algorithm
     and will usually not be executed alone.

_V_a_l_u_e:

     'fixmahal' returns an object of class 'mfpc'. This is a list
     containing the components 'nc, g, means, covs, nfound, er, tsc,
     ncoll, skc, grto, imatrix, smatrix, stn, stfound, ser, sfpc, ssig,
     sto, struc, n, p, method, cgen, ca, ca2, cvec, calpha, pointit,
     subset, mnc, startn, mer, distcut'.

     'summary.mfpc' returns an object of class 'summary.mfpc'. This is
     a list containing the components 'means, covs, stn, stfound, sn,
     ser, tskip, skc, tsc, sim, ca, ca2, calpha, mer, method, cgen,
     pointit'.

     'fpclusters.mfpc' returns a list of indicator vectors for the
     representative FPCs of stable groups.

     'fpmi' returns a list with the components 'mg, covg, md, gv, coll,
     method, ca'.

      nc: integer. Number of FPCs.

       g: list of logical vectors. Indicator vectors of FPCs. 'FALSE'
          if 'ind.storage=FALSE'.

   means: list of numerical vectors. Means of FPCs. In 'summary.mfpc',
          only for representative FPCs of stable groups and sorted
          according to 'ser'.

    covs: list of numerical matrices. Covariance matrices of FPCs. In
          'summary.mfpc', only for representative FPCs of stable groups
          and sorted according to 'ser'.

  nfound: vector of integers. Number of findings for the FPCs.

      er: numerical vector. Ratio of number of findings of FPCs to
          their size. Under 'pointit=TRUE', this can be taken as a
          measure of stability of FPCs.

     tsc: integer. Number of algorithm runs leading to too small or too
          seldom found FPCs.

   ncoll: integer. Number of algorithm runs where collinear covariance
          matrices occurred.

     skc: integer. Number of skipped clusters.

    grto: vector of integers. Numbers of FPCs to which algorithm runs
          led, which were started by 'init.group'.

 imatrix: vector of integers. Size of intersection between FPCs. See
          'sseg'.

 smatrix: numerical vector. Similarities between FPCs. See 'sseg'.

     stn: integer. Number of representative FPCs of stable groups. In
          'summary.mfpc', sorted according to 'ser'.

 stfound: vector of integers. Number of findings of members of all
          groups of FPCs. In 'summary.mfpc', sorted according to 'ser'.

     ser: numerical vector. Ratio of number of findings of groups of
          FPCs to their size. Under 'pointit=TRUE', this can be taken
          as a measure of stability of FPCs. In 'summary.mfpc', sorted
          from largest to smallest.

    sfpc: vector of integers. Numbers of representative FPCs of all
          groups.

    ssig: vector of integers of length 'stn'. Numbers of representative
          FPCs of the stable groups.

     sto: vector of integers. Numbers of groups ordered according to
          largest 'ser'.

   struc: vector of integers. Number of group an FPC belongs to.

       n: see arguments.

       p: see arguments.

  method: see arguments.

    cgen: see arguments.

      ca: see arguments, if 'cgen' has been '"fixed"'. Else numerical
          vector of length 'nc' (see below), giving the final values of
          'ca' for all FPC. In 'fpmi', tuning constant for the iterated
          FPC.

     ca2: see arguments.

    cvec: numerical vector of length 'n' for 'cgen="auto"'. The values
          for the tuning constant 'ca' corresponding to the cluster
          sizes from '1' to 'n'.

  calpha: see arguments.

 pointit: see arguments.

  subset: see arguments.

     mnc: see arguments.

  startn: see arguments.

     mer: see arguments.

 distcut: see arguments.

      sn: vector of integers. Number of points of representative FPCs.

   tskip: integer. Number of algorithm runs leading to skipped FPCs.

     sim: vector of integers. Size of intersections between
          representative FPCs of stable groups. See 'sseg'.

      mg: mean vector.

    covg: covariance matrix.

      md: Mahalanobis distances.

      gv: logical (numerical, respectively, if 'method="fuzzy"')
          indicator vector of iterated FPC.

    coll: logical. 'TRUE' means that singular covariance matrices
          occurred during the iterations.

_A_u_t_h_o_r(_s):

     Christian Hennig chrish@stats.ucl.ac.uk <URL:
     http://www.homepages.ucl.ac.uk/~ucakche/>

_R_e_f_e_r_e_n_c_e_s:

     Hennig, C. (2002) Fixed point clusters for linear regression:
     computation and comparison, _Journal of Classification_ 19,
     249-276.

     Hennig, C. (2005) Fuzzy and Crisp Mahalanobis Fixed Point
     Clusters, in Baier, D., Decker, R., and Schmidt-Thieme, L. (eds.):
     _Data Analysis and Decision Support_. Springer, Heidelberg, 47-56,
     <URL: http://www.homepages.ucl.ac.uk/~ucakche/papers/fuzzyfix.ps>

     Hennig, C. and Christlieb, N. (2002) Validating visual clusters in
     large datasets: Fixed point clusters of spectral features,
     _Computational Statistics and Data Analysis_ 40, 723-739.

_S_e_e _A_l_s_o:

     'fixreg' for linear regression fixed point clusters.

     'mahalconf', 'wfu', 'cmahal' for computation of initial
     configurations, weights, tuning constants.  

     'sseg' for indexing the similarity/intersection vectors computed
     by 'fixmahal'.

     'batcoord', 'cov.rob', 'solvecov', 'cov.wml', 'plotcluster' for
     computation of projections, (inverted) covariance matrices,
     plotting.

     'rFace' for generation of example data, see below.

_E_x_a_m_p_l_e_s:

       set.seed(20000)
       face <- rFace(400,dMoNo=2,dNoEy=0, p=3)
       # The first example uses grouping information via init.group.
       initg <- list()
       grface <- as.integer(attr(face,"grouping"))
       for (i in 1:5) initg[[i]] <- (grface==i)
       ff0 <- fixmahal(face, pointit=FALSE, init.group=initg)
       summary(ff0)
       cff0 <- fpclusters(ff0)
       plot(face, col=1+cff0[[1]])
       plot(face, col=1+cff0[[4]]) # Why does this come out as a cluster? 
       plot(ff0, face, 4) # A bit clearer...
       # Without grouping information, examples need more time:
       # ff1 <- fixmahal(face)
       # summary(ff1)
       # cff1 <- fpclusters(ff1)
       # plot(face, col=1+cff1[[1]])
       # plot(face, col=1+cff1[[6]]) # Why does this come out as a cluster? 
       # plot(ff1, face, 6) # A bit clearer...
       # ff2 <- fixmahal(face,method="ml")
       # summary(ff2)
       # ff3 <- fixmahal(face,method="ml",calpha=0.95,subset=50)
       # summary(ff3)
       ## ...fast, but lots of clusters. mer=0.3 may be useful here.
       # set.seed(3000)
       # face2 <- rFace(400,dMoNo=2,dNoEy=0)
       # ff5 <- fixmahal(face2)
       # summary(ff5)
       ## misses right eye of face data; with p=6,
       ## initial configurations are too large for 40 point clusters 
       # ff6 <- fixmahal(face2, startn=30)
       # summary(ff6)
       # cff6 <- fpclusters(ff6)
       # plot(face2, col=1+cff6[[3]])
       # plot(ff6, face2, 3)
       # x <- c(1,2,3,6,6,7,8,120)
       # ff8 <- fixmahal(x)
       # summary(ff8)
       # ...dataset a bit too small for the defaults...
       # ff9 <- fixmahal(x, mnc=3, startn=3)
       # summary(ff9)

