simClustDesign       package:clusterGeneration       R Documentation

_D_E_S_I_G_N _F_O_R _R_A_N_D_O_M _C_L_U_S_T_E_R _G_E_N_E_R_A_T_I_O_N _W_I_T_H _S_P_E_C_I_F_I_E_D _D_E_G_R_E_E _O_F _S_E_P_A_R_A_T_I_O_N

_D_e_s_c_r_i_p_t_i_o_n:

     Generating data sets via a factorial design, which has factors: 
     degree of separation, number of clusters, number of non-noisy
     variables,  number of noisy variables. The separation between any
     cluster and its  nearest neighboring clusters can be set to a
     specified value.  The covariance matrices of clusters can have
     arbitrary diameters, shapes  and orientations.

_U_s_a_g_e:

     simClustDesign(numClust=c(3,6,9), 
                    sepVal=c(0.01, 0.21, 0.342), 
                    sepLabels=c("L", "M", "H"), 
                    numNonNoisy=c(4,8,20), 
                    numNoisy=NULL, 
                    numOutlier=0, 
                    numReplicate=3, 
                    fileName="test", 
                    clustszind=2, 
                    clustSizeEq=50, 
                    rangeN=c(50,200), 
                    clustSizes=NULL,
                    covMethod=c("eigen", "onion", "c-vine", "unifcorrmat"), 
                    rangeVar=c(1, 10), 
                    lambdaLow=1, 
                    ratioLambda=10, 
                    alphad=1, 
                    eta=1, 
                    rotateind=TRUE, 
                    iniProjDirMethod=c("SL", "naive"), 
                    projDirMethod=c("newton", "fixedpoint"), 
                    alpha=0.05, 
                    ITMAX=20, 
                    eps=1.0e-10, 
                    quiet=TRUE, 
                    outputDatFlag=TRUE, 
                    outputLogFlag=TRUE, 
                    outputEmpirical=TRUE, 
                    outputInfo=TRUE)

_A_r_g_u_m_e_n_t_s:

numClust: Vector of the number of clusters for data sets in the design. 

  sepVal: Vector of desired values of the separation index between
          clusters and their nearest neighboring clusters. Each element
          of 'sepVal' can  take values within the interval '[-1, 1)'. 
          The closer to 1 an element of 'sepVal' is, the more separated
          the  pair of clusters are. The values 0.01, 0.21, 0.34 are
          the values of the separation index for two univariate
          clusters generated from N(0, 1) and N(0, A), where A=4, 6, 8,
          respectively. 'sepVal'=0.01 (A=4) indicates  a close cluster
          structure. 'sepVal'=0.21 (A=6) indicates a  separated cluster
          structure. 'sepVal'=0.34 (A=8) indicates  a well-separated
          cluster. 

sepLabels: Labels for "close", "separated", and "well-separated"
          cluster structures. By default, "L" (low) means "close", "M"
          (medium) means "separated", "H" (high) means
          "well-separated". 

numNonNoisy: Vector of the number of non-noisy variables. 

numNoisy: Vectors of the number of noisy variables. The default value
          of 'numNoisy'  is 'NULL' so that the program can
          automatically assign the value of  'numNoisy' as a vector
          with elements 1, round(p1/2), p1. 

numOutlier: The number or ratio of outliers. If 'numOutlier' is a 
          positive integer, then 'numOutlier' means the number of
          outliers.  If 'numOutlier' is a real number between (0, 1),
          then  'numOutlier' means the ratio of outliers, i.e. the
          number of outliers  is equal to  'round'('numOutlier'*n_1),
          where n_1 is the total number  of non-outliers.  If
          'numOutlier' is a real number greater than 1,  then
          'numOutlier' is rounded to an integer. 

numReplicate: Number of data sets to be generated for the same cluster
          structure specified  by the other arguments of the function
          'genRandomClust'. The default value 3 follows the design in
          Milligan (1985). 

fileName: The first part of the names of data files that record the
          generated data sets  and associated information, such as
          cluster membership of data points, labels  of noisy
          variables, separation index matrix, projection directions,
          etc.  (see details). The default value of 'fileName' is
          'test'. 

clustszind: Cluster size indicator. 'clustszind'=1 indicates that all
          cluster have equal size.  The size is specified by the
          argument 'clustSizeEq'. 'clustszind'=2 indicates that the
          cluster sizes are randomly  generated from the range
          specified by the argument 'rangeN'. 'clustszind'=3 indicates
          that the cluster sizes are specified via the vector
          'clustSizes'. The default value is 2 so that the generated
          clusters are more realistic. 

clustSizeEq: Cluster size. If the argument 'clustszind'=1, then all
          clusters will have the  equal number 'clustSizeEq' of data
          points. The value of 'clustSizeEq' should be large enough to
          get non-singular cluster covariance matrices.  We recommend
          the 'clustSizeEq' is at least 10*p, where p  is the total
          number of variables (including both non-noisy and noisy
          variables). The default value 100 is a reasonable cluster
          size. 

  rangeN: The range of cluster sizes. If 'clustszind'=2, then cluster
          sizes will be randomly generated  from the range specified by
          'rangeN'. The lower bound of the number of  clusters should
          be large enough to get non-singular cluster covariance 
          matrices. We recommend the minimum cluster size is at least
          10*p, where  p is the total number of variables (including
          both non-noisy and noisy  variables).  The default range is
          [50, 200] which can produce reasonable variability of cluster
          sizes. 

clustSizes: The sizes of clusters. If 'clustszind'=3, then cluster
          sizes will be specified by the  vector 'clustSizes'.  We
          recommend the minimum cluster size is at least  10*p, where p
          is the total number of variables (including both  non-noisy
          and noisy variables). The user needs to specify the value of
          'clustSizes'. Therefore, we set the default value of
          'clustSizes' as 'NULL'. 

covMethod: Method to generate covariance matrices for clusters (see
          details). The default method is 'eigen' so that the user can
          directly  specify the range of the diameters of clusters. 

rangeVar: Range for variances of a covariance matrix (see details). The
          default range is [1, 10] which can generate reasonable
          variability of variances. 

lambdaLow: Lower bound of the eigenvalues of cluster covariance
          matrices.  If the argument 'covMethod="eigen"', we need to
          generate eigenvalues for  cluster covariance matrices. The
          eigenvalues are randomly generated from the interval
          ['lambdaLow', 'lambdaLow'*'ratioLambda'].  In our experience,
          'lambdaLow'=1 and 'ratioLambda'=10  can give reasonable
          variability of the diameters of clusters. 'lambdaLow' should
          be positive. 

ratioLambda: The ratio of the upper bound of the eigenvalues to the
          lower bound of the  eigenvalues of cluster covariance
          matrices.  If the argument 'covMethod="eigen"', we need to
          generate eigenvalues for  cluster covariance matrices. The
          eigenvalues are randomly generated from the interval
          ['lambdaLow', 'lambdaLow'*'ratioLambda'].  In our experience,
          'lambdaLow'=1 and 'ratioLambda'=10  can give reasonable
          variability of the diameters of clusters. 'ratioLambda'
          should be larger than 1. 

  alphad: parameter for unifcorrmat method to generate random
          correlation matrix $alphad=1$ for uniform. 'alphad' should be
          positive.

     eta: parameter for "c-vine" and "onion" methods to generate random
          correlation matrix $eta=1$ for uniform. 'eta' should be
          positive.

rotateind: Rotation indicator. 'rotateind=TRUE' indicates randomly
          rotating data in non-noisy  dimensions so that we may not
          detect the full cluster structure from  pair-wise scatter
          plots of the variables. 

iniProjDirMethod: Indicating the method to get initial projection
          direction when calculating the separation index between a
          pair of clusters (c.f. Qiu and Joe, 2006a, 2006b). 
           'iniProjDirMethod'$=$"SL", the default, indicates the
          initial  projection direction is the sample version of the
          SL's projection direction  (Su and Liu, 1993, JASA)
          (boldsymbol{Sigma}_1+boldsymbol{Sigma}_2)^{-1}(boldsymbol{mu}_2-boldsymbol{mu}_1)
           'iniProjDirMethod'$=$"naive" indicates the initial
          projection  direction is boldsymbol{mu}_2-boldsymbol{mu}_1 

projDirMethod: Indicating the method to get the optimal projection
          direction when calculating  the separation index between a
          pair of clusters (c.f. Qiu and Joe, 2006a, 2006b). 
           'projDirMethod'$=$"newton" indicates we use the modified
          Newton-Raphson method to search the optimal projection
          direction  (c.f. Qiu and Joe, 2006a). This requires the
          assumptions that both covariance  matrices of the pair of
          clusters are positive-definite. If this assumption  is
          violated, the "fixedpoint" method could be used. The 
          "fixedpoint" method iteratively searches the optimal
          projection  direction based on the first derivative of the
          separation index to the  projection direction (c.f. Qiu and
          Joe, 2006b). 

   alpha: Tuning parameter reflecting the percentage in the two tails
          of a projected cluster that might be outlying. We set
          'alpha'=0.05 like we set the significance level in hypothesis
          testing as 0.05. 

   ITMAX: Maximum iteration allowed when to iteratively calculating the
          optimal projection direction. The actual number of iterations
          is usually much less than the default value 20. 

     eps: Convergence threshold. A small positive number to check if a
          quantitiy q  is equal to zero.  If |q|<'eps', then we regard
          q as equal  to zero.  'eps' is used to check if an algorithm
          converges. The default value is 1.0e-10. 

   quiet: A flag to switch on/off the outputs of intermediate results
          and/or possible warning messages. The default value is
          'TRUE'. 

outputDatFlag: Indicates if data set should be output to file. 

outputLogFlag: Indicates if log info should be output to file. 

outputEmpirical: Indicates if empirical separation indices and
          projection directions should be  calculated. This option is
          useful when generating clusters with sizes which  are not
          large enough so that the sample covariance matrices may be
          singular. Hence, by default, 'outputEmpirical=TRUE'. 

outputInfo: Indicates if theoretical and empirical separation
          information data frames  should be output to a file with
          format '[fileName]_info.log'. 

_D_e_t_a_i_l_s:

     The function 'simClustDesign' is an implementation of the design
     for  generating random clusters proposed in Qiu and Joe (2006a).
     In the design,  the degree of separation between any cluster and
     its nearest neighboring  cluster could be set to a specified value
     while the cluster covariance  matrices can be arbitrary positive
     definite matrices, and so that clusters  generated might not be
     visualized by pair-wise scatterplots of variables.  The separation
     between a pair of clusters is measured by the separation index 
     proposed in Qiu and Joe (2006b).

     The current version of the function 'simClustDesign' implements
     two  methods to generate covariance matrices for clusters. The
     first method,  denoted by 'eigen', first randomly generates
     eigenvalues  (lambda_1,...>lambda_p) for the covariance matrix 
     (boldsymbol{Sigma}), then uses columns of a randomly generated 
     orthogonal matrix 
     (boldsymbol{Q}=(boldsymbol{alpha}_1,...,boldsymbol{alpha}_p))  as
     eigenvectors. The covariance matrix  boldsymbol{Sigma} is then
     contructed as 
     boldsymbol{Q}*diag(lambda_1,...,lambda_p)*boldsymbol{Q}^T. The
     second method, denoted as 'unifcorrmat', first generates a random
     correlation matrix (boldsymbol{R}) via the method proposed in Joe
     (2006), then randomly generates variances (sigma_1^2,...,
     sigma_p^2) from  an interval specified by the argument 'rangeVar'.
     The covariance matrix  boldsymbol{Sigma} is then constructed as 
     diag(sigma_1,...,sigma_p)*boldsymbol{R}*diag(sigma_1,...,sigma_p).

     For each data set generated, the function 'simClustDesign' outputs
     four files: data file, log file, membership file, and noisy set
     file.  All four files have the same format: 

     '[fileName]J[j]G[g]v[p1]nv[p2]out[numOutlier]_[numReplicate].[extension]'       
      where 'extension' can be 'dat', 'log', 'mem', or  'noisy'. 'J'
     indicates separation index, with 'j'  indicating the level of the
     factor 'separation index';  'G' indicates number of clusters, with
     'g' indicating the  level of the factor 'number of clusters'; 'v'
     indicates  the number of non-noisy variables, with 'p1' indicating
     the level  of the factor 'number of non-noisy variables'; 'nv'
     indicates  the number of noisy variables, with 'p2' indicating the
     level of  the factor 'number of noisy variables'; 'out' indicates 
      number of outliers, with 'numOutlier' indicating the value of the
      argument 'numOutlier' of the function 'simClustDesign'; 
     'numReplicate' indicates the value of the argument 'numReplicate' 
     of the function 'simClustDesign'.

     The data file with file extension 'dat' contains n+1 rows and  p
     columns, where n is the number of data points and p is  the number
     of variables. The first row is the variable names. The log file 
     with file extension 'log' contains information such as cluster
     sizes,  mean vectors, covariance matrices, projection directions,
     separation index  matrices, etc. The membership file with file
     extension 'mem' contains  n rows and one column of cluster
     memberships for data points. The noisy  set file with file
     extension 'noisy' contains a row of labels of noisy  variables.

     When generating clusters, population covariance matrices are all 
     positive-definite. However sample covariance matrices might be 
     semi-positive-definite due to small cluster sizes. In this case,
     the  function 'genRandomClust' will automatically use the 
     "fixedpoint" method to search the optimal projection direction.

_V_a_l_u_e:

     The function outputs four data files for each data set (see
     details).

     This function also returns separation information data frames 
     'infoFrameTheory' and 'infoFrameData' based on population  and
     empirical mean vectors and covariance matrices of clusters for all
      the data sets generated. Both 'infoFrameTheory' and
     'infoFrameData'  contain the following seven columns:

Column 1:: Labels of clusters (1, 2, ..., numClust), where numClust  is
          the number of clusters for the data set. 

Column 2:: Labels of the corresponding nearest neighbors. 

Column 3:: Separation indices of the clusters to their nearest
          neighboring clusters. 

Column 4:: Labels of the corresponding farthest neighboring clusters. 

Column 5:: Separation indices of the clusters to their farthest
          neighbors. 

Column 6:: Median separation indices of the clusters to their
          neighbors. 

Column 7:: Data file names with format 
          '[fileName]J[j]G[g]v[p1]nv[p2]out[numOutlier]_[numReplicate]'
          (see details). 

datList:: a list of lists of data matrices for generated data sets. 

memList:: a list of lists of cluster memberships for data points for
          generated data sets. 

noisyList:: a list of lists of sets of noisy variables for generated
          data sets. 

_N_o_t_e:

     The speed of this function might be slow.

_A_u_t_h_o_r(_s):

     Weiliang Qiu stwxq@channing.harvard.edu
      Harry Joe harry@stat.ubc.ca

_R_e_f_e_r_e_n_c_e_s:

     Joe, H. (2006) Generating Random Correlation Matrices Based on
     Partial Correlations.  _Journal of Multivariate Analysis_, *97*,
     2177-2189.

     Milligan G. W. (1985)  An Algorithm for Generating Artificial Test
     Clusters. _Psychometrika_ *50*, 123-127.

     Qiu, W.-L. and Joe, H. (2006a) Generation of Random Clusters with
     Specified Degree of Separaion. _Journal of Classification_,
     *23*(2), 315-334.

     Qiu, W.-L. and Joe, H. (2006b) Separation Index and Partial
     Membership for Clustering. _Computational Statistics and Data
     Analysis_, *50*, 585-603.

     Su, J. Q. and Liu, J. S. (1993) Linear Combinations of Multiple
     Diagnostic Markers. _Journal of the American Statistical
     Association_, *88*, 1350-1355

_E_x_a_m_p_l_e_s:

     ## Not run: 
     tmp<-simClustDesign(numClust=3, 
                   sepVal=c(0.01,0.21), 
                   sepLabels=c("L","M"), 
                   numNonNoisy=4, 
                   numOutlier=0, 
                   numReplicate=2, 
                   clustszind=2)
     ## End(Not run)

