gap                   package:lga                   R Documentation

_P_e_r_f_o_r_m _g_a_p _a_n_a_l_y_s_i_s

_D_e_s_c_r_i_p_t_i_o_n:

     Performs the gap analysis using lga to estimate the number of
     clusters.

_U_s_a_g_e:

     gap(x, K, B, criteria=c("tibshirani", "DandF","none"),
         nnode=NULL, scale=TRUE)

_A_r_g_u_m_e_n_t_s:

       x: a numeric matrix.

       K: an integer giving the maximum number of clusters to consider.

       B: an integer giving the number of bootstraps.

criteria: a character string indicating which criteria to evaluate the
          gap data. One of '"tibshirani"' (default),'"DandF"' or
          '"none"'.  Can be abbreviated.

   nnode: an integer of many CPUS to use for parallel processing. 
          Defaults to NULL i.e. no parallel processing.

   scale: logical.  Should the data be scaled?

_D_e_t_a_i_l_s:

     This code performs the gap analysis using lga.  The gap statistic
     is defined as the difference between the log of the Residual
     Orthogonal Sum of Squared Distances (denoted log(W_k)) and its
     expected value derived using bootstrapping under the null
     hypothesis that there is only one cluster.  In this
     implementation, the reference distribution used for the
     bootstrapping is a random uniform hypercube, transformed by the
     principal components of the underlying data set. For further
     details see Tibshirani et al (2001).

     For different criteria, different rules apply. With '"tibshirani"'
     (ibid) we calculate the gap statistic for k = 1, ..., K, stopping
     when

                     gap(k) >= gap(k+1) - s_(k+1)

     where s_(k+1) is a function of standard deviation of the
     bootstrapped estimates.

     With the '"DandF"' criteria from Dudoit et al (2002), we calculate
     the gap statistic for all values of k = 1, ..., K, selecting the
     number of clusters as

  khat = smallest k >= 1 such that gap(k) >= gap(kstar) - s_(kstar)

     where kstar = argmax_(k >= 1) gap(k).

     Finally, for the criteria "none", no rules are applied, and just
     the gap data is returned.

     As lga is ostensibly unsupervised in this case, the parameter
     niter is set to 20 to ensure convergence.

     This function is parallel computing aware via the 'nnode'
     argument, and works with the package 'snow'.  In order to use
     parallel computing, one of MPI (e.g. lamboot) or PVM is necessary.
     For further details, see the documentation for 'snow'.

_V_a_l_u_e:

     An object of class '"gap"' with components 

finished: a logical.  For the "tibshirani", was there a solution found?

  nclust: a integer for the number of clusters estimated.  Returns NA
          if nothing conclusive is found.

    data: the original data set, scaled if specified in the arguments.

criteria: the criteria used.

_A_u_t_h_o_r(_s):

     Justin Harrington harringt@stat.ubc.ca

_R_e_f_e_r_e_n_c_e_s:

     Tibshirani, R. and Walther, G. and Hastie, T. (2001) 'Estimating
     the number of clusters in a data set via the gap statistic', _J.
     R. Statist. Soc. B_ *63*, 411-423.

     Dudoit, S. and Fridlyand, J. (2002) 'A prediction-based resampling
     method for estimating the number of clusters in a dataset',
     _Genome Biology_ *3*.

     Van Aelst, S. and Wang, X. and Zamar, R. and Zhu, R. (2006)
     'Linear Grouping Using Orthogonal Regression', _Computational
     Statistics & Data Analysis_ *50*, 1287-1312.

_S_e_e _A_l_s_o:

     'lga'

_E_x_a_m_p_l_e_s:

     ## Synthetic example
     ## Make a dataset with 2 clusters in 2 dimensions

     library(MASS)
     set.seed(1234)
     X <- rbind(mvrnorm(n=100, mu=c(1, -2), Sigma=diag(0.1, 2) + 0.9),
                mvrnorm(n=100, mu=c(1, 1), Sigma=diag(0.1, 2) + 0.9))

     gap(X, K=4, B=20)

     ## to run this using parallel processing with 4 nodes, the equivalent
     ## code would be

     ## Not run: gap(X, K=4, B=20, nnode=4)

     ## Quakes data (from package:datasets)
     ## Including the first two dimensions versus three dimensions
     ## yields different results

     set.seed(1234)
     ## Not run: 
     gap(quakes[,1:2], K=4, B=20)
     gap(quakes[,1:3], K=4, B=20)
     ## End(Not run)

     library(maps)
     lgaout1 <- lga(quakes[,1:2], k=3)
     plot(lgaout1)

     lgaout2 <- lga(quakes[,1:3], k=2)
     plot(lgaout2)

     ## Let's put this in context
     par(mfrow=c(1,2))
     map("world", xlim=range(quakes[,2]), ylim=range(quakes[,1])); box()
     points(quakes[,2], quakes[,1], pch=lgaout1$cluster, col=lgaout1$cluster)

     map("world", xlim=range(quakes[,2]), ylim=range(quakes[,1])); box()
     points(quakes[,2], quakes[,1], pch=lgaout2$cluster, col=lgaout2$cluster)

