clustIndex              package:cclust              R Documentation

_C_l_u_s_t_e_r _I_n_d_e_x_e_s

_D_e_s_c_r_i_p_t_i_o_n:

     `y' is the result of a clustering algorithm of class such as
     `"cclust"'. This function is calculating the values of several
     clustering indexes. The values of the indexes can be independenly
     used in order to determine the number of clusters existing in a
     data set.

_U_s_a_g_e:

      clustIndex ( y, x, index = "all" ) 

_A_r_g_u_m_e_n_t_s:

       y: Object of class `"cclust"' returned by a clustering algorithm
          such as `kmeans'

       x: Data matrix where columns correspond to variables and rows to
          observations

   index: The indexes that are calculated `"calinski"', `"cindex"',
          `"db"', `"hartigan"', `"ratkowsky"', `"scott"', `"marriot"',
          `"ball"', `"trcovw"', `"tracew"', `"friedman"', `"rubin"',
          `"ssi"', `"likelihood"', and `"all"' for all the indexes.
          Abbreviations of these names are also accepted.

_D_e_t_a_i_l_s:

     The description of the indexes is categorized into 3 groups, based
     on the statistics mainly used to compute them.
     The first group is based on the sum of squares within (SSW) and
     between (SSB) the clusters. These statistics measure the
     dispersion of the data points in a cluster and between the
     clusters respectively. These indexes are:

          *  calinski: (SSB/(k-1))/(SSW/(n-k)), where n is the number
             of data points and k is the number of clusters.

          *  hartigan: then log(SSB/SSW).

          *  ratkowsky:  mean(sqrt{(varSSB/varSST)}), where varSSB
             stands for the SSB for every variable and varSST for the
             total sum of squares for every variable.

          *  ball: SSW/k, where k is the number of clusters. .in -5

             The second group is based on the statistics of T, i.e.,
             the scatter matrix of the data points, and W, which is the
             sum of the scatter matrices in every group. These indexes
             are:

               *  scott: nlog(|T|/|W|), where n is the number of data
                  points and |cdot| stands for the determinant of a
                  matrix.

               *  marriot: k^2 |W|, where k is the number of clusters.

               *  trcovw: Trace Cov W.

               *  tracew: Trace W.

               *  friedman: Trace W^{(-1)} B, where B is the scatter
                  matrix of the cluster centers.

               *  rubin: |T|/|W|. .in -5

                  The third group consists of four algorithms not
                  belonging to the previous ones and not having
                  anything in common.

                    *  cindex: if the data set is binary, then while
                       the C-Index is a cluster similarity measure, is
                       expressed as:
                       [d_{(w)}-min(d_{(w)})]/[max(d_{(w)})-min(d_{(w)}
                       )], where d_{(w)} is the sum of all n_{(d)}
                       within cluster distances, min(d_{(w)}) is the
                       sum of the n_{(d)} smallest pairwise distances
                       in the data set, and max (d_{(w)}) is the sum of
                       the n_{(d)} biggest pairwise distances. In order
                       to compute the C-Index all pairwise distances in
                       the data set have to be computed and stored. In
                       the case of binary data, the storage of the
                       distances is creating no problems since there
                       are only a few possible distances. However, the
                       computation of all distances can make this index
                       prohibitive for large data sets.

                    *  db: R=(1/n)*sum(R_{(i)}) where R_{(i)} stands
                       for the maximum value of R_{(ij)} for ineq j,
                       and R_{(ij)} for
                       R_{(ij)}=(SSW_{(i)}+SSW_{(j)})/DC_{(ij)}, where
                       DC_{(ij)} is the distance between the centers of
                       two clusters i, j.

                    *  likelihood: under the assumption of independence
                       of the variables within a cluster, a cluster
                       solution can be regarded as a mixture model for
                       the data, where the cluster centers give the
                       probabilities for each variable to be 1.
                       Therefore, the negative Log-likelihood can be
                       computed and used as a quantity measure for a
                       cluster solution. Note that the assumptions for
                       applying special penalty terms, like in AIC or
                       BIC, are not fulfilled in this model, and also
                       they show no effect for these data sets.

                    *  ssi: this ``Simple Structure Index'' combines
                       three elements which influence the
                       interpretability of a solution, i.e., the
                       maximum difference of each variable between the
                       clusters, the sizes of the most contrasting
                       clusters and the deviation of a variable in the
                       cluster centers compared to its overall mean.
                       These three elements are multiplicatively
                       combined and normalized to give a value between
                       0 and 1. .in -5 

_V_a_l_u_e:

     Returns an vector with the indexes values.

_A_u_t_h_o_r(_s):

     Evgenia Dimitriadou and Andreas Weingessel

_R_e_f_e_r_e_n_c_e_s:

     Andreas Weingessel, Evgenia Dimitriadou and Sara Dolnicar, An
     Examination Of Indexes For Determining The Number Of Clusters In
     Binary Data Sets,
     <URL: http://www.wu-wien.ac.at/am/wp99.htm#29>
     and the references therein.

_S_e_e _A_l_s_o:

     `cclust', `kmeans'

_E_x_a_m_p_l_e_s:

     # a 2-dimensional example
     x<-rbind(matrix(rnorm(100,sd=0.3),ncol=2),
              matrix(rnorm(100,mean=1,sd=0.3),ncol=2))
     cl<-cclust(x,2,20,verbose=TRUE,method="kmeans")
     resultindexes <- clustIndex(cl,x, index="all")
     resultindexes   

