genRandomClust       package:clusterGeneration       R Documentation

_R_A_N_D_O_M _C_L_U_S_T_E_R _G_E_N_E_R_A_T_I_O_N _W_I_T_H _S_P_E_C_I_F_I_E_D _D_E_G_R_E_E _O_F _S_E_P_A_R_A_T_I_O_N

_D_e_s_c_r_i_p_t_i_o_n:

     Generate cluster data sets with specified degree of separation. 
     The separation between any cluster and its nearest neighboring
     cluster can be  set to a specified value. The covariance matrices
     of clusters can have  arbitrary diameters, shapes and
     orientations.

_U_s_a_g_e:

     genRandomClust(numClust, 
                    sepVal=0.01, 
                    numNonNoisy=2, 
                    numNoisy=0, 
                    numOutlier=0, 
                    numReplicate=3, 
                    fileName="test",  
                    clustszind=2, 
                    clustSizeEq=50, 
                    rangeN=c(50,200), 
                    clustSizes=NULL, 
                    covMethod=c("eigen", "onion", "c-vine", "unifcorrmat"), 
                    rangeVar=c(1, 10), 
                    lambdaLow=1, 
                    ratioLambda=10,  
                    alphad=1,
                    eta=1,
                    rotateind=TRUE, 
                    iniProjDirMethod=c("SL", "naive"), 
                    projDirMethod=c("newton", "fixedpoint"), 
                    alpha=0.05, 
                    ITMAX=20, 
                    eps=1.0e-10, 
                    quiet=TRUE, 
                    outputDatFlag=TRUE, 
                    outputLogFlag=TRUE, 
                    outputEmpirical=TRUE, 
                    outputInfo=TRUE)

_A_r_g_u_m_e_n_t_s:

numClust: Number of clusters in a data set. 

  sepVal: Desired value of the separation index between a cluster and
          its nearest neighboring cluster. Theoretically, 'sepVal' can
          take  values within the interval [-1, 1)  (In practice, we
          set 'sepVal' in (-0.999, 0.999)).  The closer to 1 'sepVal'
          is, the more separated clusters are. The default value is
          0.01 which is the value of the separation index for two
          univariate clusters generated from N(0, 1) and N(0, A),
          respectively,  where A=4.  'sepVal'=0.01 indicates a close
          cluster structure.  'sepVal'=0.21 (A=6) indicates a separated
          cluster structure.  'sepVal'=0.34 (A=8) indicates a
          well-separated cluster. 

numNonNoisy: Number of non-noisy variables. 

numNoisy: Number of noisy variables. The default values of 'numNoisy'
          and 'numOutlier' are 0 so  that we get clean data sets.  

numOutlier: Number or ratio of outliers. If 'numOutlier' is a positive
          integer,  then 'numOutlier' means the number of outliers.  If
          'numOutlier' is a real number between (0, 1), then 
          'numOutlier' means the ratio of outliers, i.e. the number of
          outliers  is equal to 'round'('numOutlier'*n_1), where n_1 is
           the total number of non-outliers.  If 'numOutlier' is a real
          number  greater than 1, then 'numOutlier' to rounded to an
          integer. The default values of 'numNoisy' and 'numOutlier'
          are  0 so that we get 'clean' data sets.  

numReplicate: Number of data sets to be generated for the same cluster
          structure specified  by the other arguments of the function
          'genRandomClust'. The default value 3 follows the design in
          Milligan (1985). 

fileName: The first part of the names of data files that record the
          generated data sets  and associated information, such as
          cluster membership of data points, labels  of noisy
          variables, separation index matrix, projection directions,
          etc.  (see details). The default value of 'fileName' is
          'test'. 

clustszind: Cluster size indicator. 'clustszind'=1 indicates that all
          cluster have equal size.  The size is specified by the
          argument 'clustSizeEq'. 'clustszind'=2 indicates that the
          cluster sizes are randomly  generated from the range
          specified by the argument 'rangeN'. 'clustszind'=3 indicates
          that the cluster sizes are specified via the vector
          'clustSizes'. The default value is 2 so that the generated
          clusters are more realistic. 

clustSizeEq: Cluster size. If the argument 'clustszind'=1, then all
          clusters will have the  equal number 'clustSizeEq' of data
          points. The value of 'clustSizeEq'  should be large enough to
          get non-singular cluster covariance matrices.  We recommend
          the 'clustSizeEq' is at least 10*p, where p  is the total
          number of variables (including both non-noisy and noisy
          variables). The default value 100 is a reasonable cluster
          size. 

  rangeN: The range of cluster sizes. If 'clustszind'=2, then cluster
          sizes will be randomly generated  from the range specified by
          'rangeN'. The lower bound of the number of  clusters should
          be large enough to get non-singular cluster covariance 
          matrices. We recommend the minimum cluster size is at least
          10*p, where  p is the total number of variables (including
          both non-noisy and noisy  variables).  The default range is
          [50, 200] which can produce reasonable variability of cluster
          sizes. 

clustSizes: The sizes of clusters. If 'clustszind'=3, then cluster
          sizes will be specified via the  vector 'clustSizes'.  We
          recommend the minimum cluster size is at least  10*p, where p
          is the total number of variables (including both  non-noisy
          and noisy variables). The user needs to specify the value of
          'clustSizes'. Therefore, we set the default value of
          'clustSizes' as 'NULL'. 

covMethod: Method to generate covariance matrices for clusters (see
          details). The default method is 'eigen' so that the user can
          directly  specify the range of the diameters of clusters. 

rangeVar: Range for variances of a covariance matrix (see details). The
          default range is [1, 10] which can generate reasonable
          variability of variances. 

lambdaLow: Lower bound of the eigenvalues of cluster covariance
          matrices.  If the argument "covMethod="eigen"", we need to
          generate eigenvalues for cluster covariance matrices. The
          eigenvalues are randomly generated from the interval
          ['lambdaLow', 'lambdaLow'*'ratioLambda'].  In our experience,
          'lambdaLow'=1 and 'ratioLambda'=10  can give reasonable
          variability of the diameters of clusters. 'lambdaLow' should
          be positive. 

ratioLambda: The ratio of the upper bound of the eigenvalues to the
          lower bound of the  eigenvalues of cluster covariance
          matrices.  If the argument 'covMethod="eigen"', we need to
          generate eigenvalues for cluster covariance matrices. The
          eigenvalues are randomly generated from the interval
          ['lambdaLow', 'lambdaLow'*'ratioLambda'].  In our experience,
          'lambdaLow'=1 and 'ratioLambda'=10  can give reasonable
          variability of the diameters of clusters. 'ratioLambda'
          should be larger than 1. 

  alphad: parameter for unifcorrmat method to generate random
          correlation matrix $alphad=1$ for uniform. 'alphad' should be
          positive.

     eta: parameter for "c-vine" and "onion" methods to generate random
          correlation matrix $eta=1$ for uniform. 'eta' should be
          positive.

rotateind: Rotation indicator. 'rotateind=TRUE' indicates randomly
          rotating data in non-noisy  dimensions so that we may not
          detect the full cluster structure from  pair-wise scatter
          plots of the variables. 

iniProjDirMethod: Indicating the method to get initial projection
          direction when calculating the separation index between a
          pair of clusters (c.f. Qiu and Joe, 2006a, 2006b). 
           'iniProjDirMethod'$=$"SL", the default, indicates the
          initial  projection direction is the sample version of the
          SL's projection direction  (Su and Liu, 1993, JASA)
          (boldsymbol{Sigma}_1+boldsymbol{Sigma}_2)^{-1}(boldsymbol{mu}_2-boldsymbol{mu}_1)
           'iniProjDirMethod'$=$"naive" indicates the initial
          projection  direction is boldsymbol{mu}_2-boldsymbol{mu}_1 

projDirMethod: Indicating the method to get the optimal projection
          direction when calculating  the separation index between a
          pair of clusters (c.f. Qiu and Joe, 2006a, 2006b). 
           'projDirMethod'$=$"newton" indicates we use the modified
          Newton-Raphson method to search the optimal projection
          direction  (c.f. Qiu and Joe, 2006a). This requires the
          assumptions that both covariance  matrices of the pair of
          clusters are positive-definite. If this assumption  is
          violated, the "fixedpoint" method could be used. The 
          "fixedpoint" method iteratively searches the optimal
          projection  direction based on the first derivative of the
          separation index to the  projection direction (c.f. Qiu and
          Joe, 2006b). 

   alpha: Tuning parameter reflecting the percentage in the two tails
          of a projected cluster that might be outlying. We set
          'alpha'=0.05 like we set the significance level in hypothesis
          testing as 0.05. 

   ITMAX: Maximum iteration allowed when iteratively calculating the
          optimal projection direction. The actual number of iterations
          is usually much less than the default value 20. 

     eps: Convergence threshold. A small positive number to check if a
          quantitiy q  is equal to zero.  If |q|<'eps', then we regard
          q is equal  to zero.  'eps' is used to check if an algorithm
          converges. The default value is 1.0e-10. 

   quiet: A flag to switch on/off the outputs of intermediate results
          and/or possible warning messages. The default value is
          'TRUE'. 

outputDatFlag: Indicates if data set should be output to file. 

outputLogFlag: Indicates if log info should be output to file. 

outputEmpirical: Indicates if empirical separation indices and
          projection directions should be  calculated. This option is
          useful when generating clusters with sizes which  are not
          large enough so that the sample covariance matrices may be
          singular. Hence, by default, 'outputEmpirical=TRUE'. 

outputInfo: Indicates if theoretical and empirical separation
          information data frames  should be output to a file with
          format '[fileName]_info.log'. 

_D_e_t_a_i_l_s:

     The function 'genRandomClust' is an implementation of the random
     cluster  generation method proposed in Qiu and Joe (2006a) which
     improve the cluster  generation method proposed in Milligan (1985)
     so that the degree of separation  between any cluster and its
     nearest neighboring cluster could be set to a  specified value
     while the cluster covariance matrices can be arbitrary positive
     definite matrices, and so that clusters generated might not be
     visualized  by pair-wise scatterplots of variables. The separation
     between a pair of  clusters is measured by the separation index
     proposed in Qiu and Joe (2006b).

     The current version of the function 'genRandomClust' implements
     two  methods to generate covariance matrices for clusters. The
     first method,  denoted by 'eigen', first randomly generates
     eigenvalues  (lambda_1,...>lambda_p) for the covariance matrix 
     (boldsymbol{Sigma}), then uses columns of a randomly generated 
     orthogonal matrix 
     (boldsymbol{Q}=(boldsymbol{alpha}_1,...,boldsymbol{alpha}_p))  as
     eigenvectors. The covariance matrix  boldsymbol{Sigma} is then
     contructed as  boldsymbol{Q}*diag(lambda_1,...,
     lambda_p)*boldsymbol{Q}^T. The second method, denoted as
     "unifcorrmax", first generates a random  correlation matrix
     (boldsymbol{R}) via the method proposed in Joe (2006), then
     randomly generates variances (sigma_1^2,..., sigma_p^2) from  an
     interval specified by the argument 'rangeVar'. The covariance
     matrix  boldsymbol{Sigma} is then constructed as 
     diag(sigma_1,...,sigma_p)*boldsymbol{R}*diag(sigma_1,...,sigma_p).

     For each data set generated, the function 'genRandomClust' outputs
     four files: data file, log file, membership file, and noisy set
     file.  All four files have the same format:
     '[fileName]_[i].[extension]',  where i indicates the replicate
     number, and 'extension' can be  'dat', 'log', 'mem', and 'noisy'. 

     The data file with file extension 'dat' contains n+1 rows and  p
     columns, where n is the number of data points and p  is the number
     of variables. The first row is the variable names.  The log file
     with file extension 'log' contains information such  as cluster
     sizes, mean vectors, covariance matrices, projection directions, 
     separation index matrices, etc. The membership file with file
     extension  'mem' contains n rows and one column of cluster
     memberships for  data points. The noisy set file with file
     extension 'noisy' contains  a row of labels of noisy variables.

     When generating clusters, population covariance matrices are all 
     positive-definite. However sample covariance matrices might be 
     semi-positive-definite due to small cluster sizes. In this case,
     the  function 'genRandomClust' will automatically use the 
     "fixedpoint" method to search the optimal projection direction.

     The current version of the function 'genPositiveDefMat' implements
     four  methods to generate random covariance matrices. The first
     method, denoted by  "eigen", first randomly generates eigenvalues 
     (lambda_1,...,lambda_p) for the covariance matrix 
     (boldsymbol{Sigma}), then uses columns of a randomly generated
     orthogonal matrix 
     (boldsymbol{Q}=(boldsymbol{alpha}_1,...,boldsymbol{alpha}_p))  as
     eigenvectors. The covariance matrix boldsymbol{Sigma} is then 
     contructed as 
     boldsymbol{Q}*diag(lambda_1,...,lambda_p)*boldsymbol{Q}^T.

     The remaining methods, denoted as "onion", "c-vine", and
     "unifcorrmat" respectively, first generates a random  correlation
     matrix (boldsymbol{R}) via the method mentioned and proposed in
     Joe (2006), then randomly generates variances
     (sigma_1^2,...,sigma_p^2) from  an interval specified by the
     argument 'rangeVar'. The covariance matrix  boldsymbol{Sigma} is
     then constructed as 
     diag(sigma_1,...,sigma_p)*boldsymbol{R}*diag(sigma_1,...,sigma_p).

_V_a_l_u_e:

     The function outputs four data files for each data set (see
     details).

     This function also returns separation information data frames 
     'infoFrameTheory' and 'infoFrameData' based on population  and
     empirical mean vectors and covariance matrices of clusters for all
      the data sets generated. Both 'infoFrameTheory' and
     'infoFrameData'  contain the following seven columns:

Column 1:: Labels of clusters (1, 2, ..., numClust), where numClust  is
          the number of clusters for the data set. 

Column 2:: Labels of the corresponding nearest neighbors. 

Column 3:: Separation indices of the clusters to their nearest
          neighboring clusters. 

Column 4:: Labels of the corresponding farthest neighboring clusters. 

Column 5:: Separation indices of the clusters to their farthest
          neighbors. 

Column 6:: Median separation indices of the clusters to their
          neighbors. 

Column 7:: Data file names with format '[fileName]_[i]', where i
          indicates  the replicate number. 

datList:: a list of data matrices for generated data sets. 

memList:: a list of luster memberships for data points for generated
          data sets. 

noisyList:: a list of sets of noisy variables for generated data sets. 

_N_o_t_e:

     This function might be take a while to complete.

_A_u_t_h_o_r(_s):

     Weiliang Qiu stwxq@channing.harvard.edu
      Harry Joe harry@stat.ubc.ca

_R_e_f_e_r_e_n_c_e_s:

     Joe, H. (2006) Generating Random Correlation Matrices Based on
     Partial Correlations.  _Journal of Multivariate Analysis_, *97*,
     2177-2189.

     Milligan G. W. (1985)  An Algorithm for Generating Artificial Test
     Clusters. _Psychometrika_ *50*, 123-127.

     Qiu, W.-L. and Joe, H. (2006a) Generation of Random Clusters with
     Specified Degree of Separaion. _Journal of Classification_,
     *23*(2), 315-334.

     Qiu, W.-L. and Joe, H. (2006b) Separation Index and Partial
     Membership for Clustering. _Computational Statistics and Data
     Analysis_, *50*, 585-603.

     Su, J. Q. and Liu, J. S. (1993) Linear Combinations of Multiple
     Diagnostic Markers. _Journal of the American Statistical
     Association_, *88*, 1350-1355.

     Ghosh, S., Henderson, S. G. (2003). Behavior of the NORTA method
     for correlated random vector generation as the dimension
     increases. _ACM Transactions on Modeling and Computer Simulation
     (TOMACS)_, *13(3)*, 276-294.

     Kurowicka and Cooke, 2006. _Uncertainty Analysis with High
     Dimensional Dependence Modelling_, Wiley, 2006.

_E_x_a_m_p_l_e_s:

     ## Not run: 
     tmp1 <- genRandomClust(numClust=7, sepVal=0.3, numNonNoisy=5,  
                     numNoisy=3, numOutlier=5, numReplicate=2, fileName="chk1")
     ## End(Not run)
     ## Not run: 
     tmp2 <- genRandomClust(numClust=7, sepVal=0.3, numNonNoisy=5,  
                     numNoisy=3, numOutlier=5, numReplicate=2, 
                     covMethod="unifcorrmat", fileName="chk2")
     ## End(Not run)
     ## Not run: 
     tmp3 <- genRandomClust(numClust=2, sepVal=-0.1, numNonNoisy=2,  
                     numNoisy=6, numOutlier=30, numReplicate=1, 
                     clustszind=1, clustSizeEq=80, rangeVar=c(10, 20),
                     covMethod="unifcorrmat", iniProjDirMethod="naive",
                     projDirMethod="fixedpoint", fileName="chk3")
     ## End(Not run)

