skmeans               package:skmeans               R Documentation

_C_o_m_p_u_t_e _S_p_h_e_r_i_c_a_l _k-_M_e_a_n_s _P_a_r_t_i_t_i_o_n_s

_D_e_s_c_r_i_p_t_i_o_n:

     Partition given vectors x_b by minimizing the criterion sum_{b,j}
     w_b u_{bj}^m d(x_b, p_j) where the w_b are case weights, u_{bj} is
     the membership of x_b to class j, p_j is the _prototype_ of class
     j (thus minimizing sum_b w_b u_{bj}^m d(x_b, p) over p), and d is
     the cosine dissimilarity d(x, p) = 1 - cos(x, p).

_U_s_a_g_e:

     skmeans(x, k, method = NULL, m = 1, weights = 1, control = list())

_A_r_g_u_m_e_n_t_s:

       x: A numeric data matrix, with rows corresponding to the objects
          to be partitioned (such that row b contains x_b).  Can be a
          dense matrix, a simple triplet matrix (package 'slam'), or a
          dgTMatrix (package 'Matrix').

       k: an integer giving the number of classes to be used in the
          partition.

  method: a character string specifying one of the built-in methods for
          computing spherical k-means partitions, or a function to be
          taken as a user-defined method, or 'NULL' (default value). 
          If a character string, its lower-cased version is matched
          against the lower-cased names of the available built-in
          methods using 'pmatch'.  See *Details* for available built-in
          methods and defaults.

       m: a number not less than 1 controlling the softness of the
          partition (as the fuzzification parameter of the fuzzy
          c-means algorithm).  The default value of 1 corresponds to
          hard partitions obtained from the standard spherical k-means
          problem; values greater than one give partitions of
          increasing softness obtained from a generalized fuzzy
          spherical k-means problem.

 weights: a numeric vector of non-negative case weights. Recycled to
          the number of elements given by 'x' if necessary.

 control: a list of control parameters.  See *Details*.

_D_e_t_a_i_l_s:

     The simple spherical k-means problem where all case weights are
     one and m = 1 is equivalent to maximizing the criterion sum_j
     sum_{b in C_j} cos(x_b, p_j), where C_j is the j-th class of the
     partition.  This is the formulation used in Dhillon & Modha (2001)
     and related references, and is criterion function mathcal{I}_2 in
     the CLUTO documentation.

     Obtaining optimal spherical k-means partitions obviously is a
     computationally hard problem, and several methods are available
     which attempt to obtain optimal partitions.  The built-in methods
     are as follows.


     '"_g_e_n_e_t_i_c"' a genetic algorithm patterned after the genetic
          k-means algorithm of Krishna & Narasimha Murty (1999).

     '"_p_c_l_u_s_t"' a Lloyd-Forgy style fixed-point algorithm which
          iterates between determining optimal memberships for fixed
          prototypes, and computing optimal prototypes for fixed
          memberships, using the general-purpose prototype-based
          partitioning framework of 'pclust' in package 'clue'.

     '"_C_L_U_T_O"' an interface to the 'vcluster' partitional clustering
          program from CLUTO, the CLUstering TOolkit by George Karypis.

     '"_l_i_h"' the Local Improvement Heuristic of Dhillon, Guan and Kogan
          (2002).  If a fixed-point iteration does not (sufficiently)
          improve the criterion, improvement by swapping the clusters
          of two elements is attempted.

     '"_l_i_h_c"' the Local Improvement Heuristic with Chains method of
          Dhillon, Guan and Kogan (2002).  This attempts further
          improvements via Kernighan-Lin object moves chains.

     Method '"pclust"' is the only method available for soft spherical
     k-means problems.  Method '"genetic"' can handle case weights.  By
     default, the genetic algorithm is used for obtained hard
     partitions, and the fixed-point algorithm otherwise.

     Control parameters for method '"genetic"' are as follows.

     '_m_a_x_i_t_e_r' an integer giving the maximum number of iterations for
          the genetic algorithm.  Defaults to 12.

     '_p_o_p_s_i_z_e' an integer giving the population size for the genetic
          algorithm.  Default: 6.

     '_m_u_t_a_t_i_o_n_s' a number between 0 and 1 giving the probability of
          mutation per iteration.  Defaults to 0.1.

     '_s_t_a_r_t' a list with the prototypes for the initial population.

     '_r_e_l_t_o_l' The minimum relative improvement per iteration.  If
          improvement is less, the algorithm will stop under the
          assumption that no further significant improvement can be
          made.  Defaults to 1e-8.

     '_v_e_r_b_o_s_e' a logical indicating whether to provide some output on
          minimization progress. Defaults to 'getOption("verbose")'.

     See the documentation of 'pclust' in package 'clue' for the
     control parameters for method '"pclust"'.

     Control parameters for method '"CLUTO"' are as follows.

     '_v_c_l_u_s_t_e_r' the path to the CLUTO 'vcluster' executable.

     '_c_o_l_m_o_d_e_l' a specification of the CLUTO column model. See the
          CLUTO documentation for more details.

     '_v_e_r_b_o_s_e' as for the genetic algorithm.

     '_c_o_n_t_r_o_l' a character string specifying arguments passed over to
          the 'vcluster' executable.

     Control parameters for the local improvement heuristics are as
     follows.

     '_m_a_x_i_t_e_r' an integer giving the maximal number of iterations to be
          performed. Defaults to 100.

     '_s_t_a_r_t' a single prototype to be used as a starting  value.

     '_r_e_l_t_o_l' as for the genetic algorithm.

     '_v_e_r_b_o_s_e' as for the genetic algorithm.

     The enhanced chain-based heuristic has the following additional
     control parameter.

     '_m_a_x_c_h_a_i_n_s' an integer giving the maximal length of the
          Kernighan-Lin chains.  Defaults to 10.

     Method '"CLUTO"' requires that the CLUTO 'vcluster' executable is
     available.  CLUTO binaries for the Linux, Sun, OSX, and MS Windows
     platforms can be obtained from <URL:
     http://www-users.cs.umn.edu/~karypis/cluto/>.

     User-defined methods must have formals 'x', 'k' and 'control', and
     optionally may have formals 'weights'  or 'm' if providing support
     for case weights or soft spherical k-means partitions,
     respectively.

_V_a_l_u_e:

     An object of class 'pclust' (see the information on pclust objects
     in package 'clue' for further details) representing the obtained
     spherical k-means partition, which is a list with components
     including the following: 

prototypes: a dense matrix with 'k' rows giving the prototypes.

membership: cluster membership as a matrix with 'k' columns.

 cluster: the class ids of the closest hard partition (the partition
          itself if m = 1).

   value: the value of the criterion.

_A_u_t_h_o_r(_s):

     Kurt Hornik Kurt.Hornik@wu.ac.at, 
      Ingo Feinerer feinerer@logic.at, 
      Martin Kober martin.kober@wu.ac.at.

_R_e_f_e_r_e_n_c_e_s:

     I. S. Dhillon and D. S. Modha (2001). Concept decompositions for
     large sparse text data using clustering. _Machine Learning_, *42*,
     143-175.

     I. S. Dhillon and Y. Guan and J. Kogan (2002). Iterative
     clustering of high dimensional text data augmented by local
     search. In _Proceedings of the Second IEEE International
     Conference on Data Mining_, pages 131-138. <URL:
     http://www.cs.utexas.edu/users/inderjit/public_papers/iterative_icdm02.pdf>.

     K. Krishna and M. Narasimha Murty (1999). Genetic K-means
     algorithm. _IEEE Transactions on Systems, Man, and Cybernetics -
     Part B: Cybernetics_, *29*/3, 433-439. <URL:
     http://eprints.iisc.ernet.in/2937/1/genetic-k.pdf>.

     G. Karypis (2003). _CLUTO: A Clustering Toolkit_. Technical Report
     #02-017, Department of Computer Science, University of Minnesota.
     <URL: http://glaros.dtc.umn.edu/gkhome/fetch/sw/cluto/manual.pdf>.

_E_x_a_m_p_l_e_s:

     ## Use CLUTO dataset 're0':
     x <- readCM(system.file("cluto", "re0.mat",
                             package = "skmeans"),
                 system.file("cluto", "re0.mat.clabel",
                             package = "skmeans"))
     ## Which is not really small:
     dim(x)
     ## Partition into 5 clusters.
     party <- skmeans(x, 5, control = list(verbose = TRUE))
     ## Criterion value obtained:
     party$value
     ## Compare with "true" classifications:
     class_ids <-
         readLines(system.file("cluto", "re0.mat.rclass",
                               package = "skmeans"))
     table(class_ids, party$cluster)

