pvclust               package:pvclust               R Documentation

_C_a_l_c_u_l_a_t_i_n_g _P-_v_a_l_u_e_s _f_o_r _H_i_e_r_c_h_i_c_a_l _C_l_u_s_t_e_r_i_n_g

_D_e_s_c_r_i_p_t_i_o_n:

     calculates p-values for hierarchical clustering via multiscale
     bootstrap resampling. Hierarchical clustering is done for given
     data and p-values are computed for each of the clusters.

_U_s_a_g_e:

     pvclust(data, method.hclust="average",
             method.dist="correlation", use.cor="pairwise.complete.obs",
             nboot=1000, r=seq(.5,1.4,by=.1), store=FALSE, weight=FALSE)

     parPvclust(cl, data, method.hclust="average",
                method.dist="correlation", use.cor="pairwise.complete.obs",
                nboot=1000, r=seq(.5,1.4,by=.1), store=FALSE, weight=FALSE,
                init.rand=TRUE, seed=NULL)

_A_r_g_u_m_e_n_t_s:

    data: numeric data matrix or data frame.

method.hclust: the agglomerative method used in hierarchical
          clustering. This should be (an abbreviation of) one of
          '"average"', '"ward"', '"single"', '"complete"',
          '"mcquitty"', '"median"' or '"centroid"'. The default is
          '"average"'. See 'method' argument in 'hclust'. 

method.dist: the distance measure to be used. This should be (an
          abbreviation of) one of '"correlation"', '"uncentered"',
          '"abscor"' or those which are allowed for 'method' argument
          in 'dist' function. The default is '"correlation"'. See
          _details_ section in this help and 'method' argument in
          'dist'. 

 use.cor: character string which specifies the method for computing
          correlation with data including missing values. This should
          be (an abbreviation of) one of '"all.obs"', '"complete.obs"'
          or '"pairwise.complete.obs"'. See the 'use' argument in 'cor'
          function. 

   nboot: the number of bootstrap replications. The default is '1000'.

       r: numeric vector which specifies the relative sample sizes of
          bootstrap replications. For original sample size n and
          bootstrap sample size n', this is defined as r=n'/n.

   store: locical. If 'store=TRUE', all bootstrap replications are
          stored in the output object. The default is 'FALSE'.

      cl: 'snow' cluster object which may be generated by function
          'makeCluster'. See 'snow-startstop' in 'snow' package.

  weight: logical. If 'weight=TRUE', resampling is made by weight
          vector instead of index vector. Useful for large 'r' value
          ('r>10').  Currently, available only for distance
          '"correlation"' and '"abscor"'.

init.rand: logical. If 'init.rand=TRUE', random number generators are
          initialized at child processes. Random seeds can be set by
          'seed' argument.

    seed: integer vector of random seeds. It should have the same
          length as 'cl'. If 'NULL' is specified, '1:length(cl)' is
          used as seed vector. The default is 'NULL'.

_D_e_t_a_i_l_s:

     Function 'pvclust' conducts multiscale bootstrap resampling to
     calculate p-values for each cluster in the result of hierarchical
     clustering. 'parPvclust' is the parallel version of this procedure
     which depends on 'snow' package for parallel computation.

     For data expressed as (n, p) matrix or data frame, we assume that
     the data is n observations of p objects, which are to be
     clustered. The i'th row vector corresponds to the i'th observation
     of these objects and the j'th column vector corresponds to a
     sample of j'th object with size n.

     There are several methods to measure the dissimilarities between
     objects. For data matrix X, '"correlation"' method takes

                           1 - cor(X)[j,k]

     for dissimilarity between j'th and k'th object, where cor is
     function 'cor'.

     '"uncentered"' takes uncentered sample correlation

 1 - sum(x[,j] * x[,k]) / (sqrt(sum(x[,j]^2)) * sqrt(sum(x[,k]^2)))

     and '"abscor"' takes the absolute value of sample correlation

                        1 - abs(cor(X)[j,k]).

_V_a_l_u_e:

  hclust: hierarchical clustering for original data generated by
          function 'hclust'. See 'hclust' for details.

   edges: data frame object which contains p-values and supporting
          informations such as standard errors.

   count: data frame object which contains primitive information about
          the result of multiscale bootstrap resampling.

   msfit: list whose elements are results of curve fitting for
          multiscale bootstrap resampling, of class 'msfit'. See
          'msfit' for details.

   nboot: numeric vector of number of bootstrap replications.

       r: numeric vector of the relative sample size for bootstrap
          replications.

   store: list contains bootstrap replications if 'store=TRUE' was
          given for function 'pvclust' or 'parPvclust'.

_A_u_t_h_o_r(_s):

     Ryota Suzuki ryota.suzuki@is.titech.ac.jp

_R_e_f_e_r_e_n_c_e_s:

     Shimodaira, H. (2004) "Approximately unbiased tests of regions
     using multistep-multiscale bootstrap resampling", _Annals of
     Statistics_, 32, 2616-2641.

     Shimodaira, H. (2002) "An approximately unbiased test of
     phylogenetic tree selection", _Systematic Biology_, 51, 492-508.

     Suzuki, R. and Shimodaira, H. (2004) "An application of multiscale
     bootstrap resampling to hierarchical clustering of microarray
     data: How accurate are these clusters?", _The Fifteenth
     International Conference on Genome Informatics 2004_, P034.

     <URL: http://www.is.titech.ac.jp/~shimo/prog/pvclust/>

_S_e_e _A_l_s_o:

     'lines.pvclust', 'print.pvclust', 'msfit', 'plot.pvclust',
     'text.pvclust', 'pvrect' and 'pvpick'.

_E_x_a_m_p_l_e_s:

     ## using Boston data in package MASS
     library(MASS)
     data(Boston)

     ## multiscale bootstrap resampling
     boston.pv <- pvclust(Boston, nboot=100)

     ## CAUTION: nboot=100 may be too small for actual use.
     ##          We suggest nboot=1000 or larger.
     ##          plot/print functions will be useful for diagnostics.

     ## plot dendrogram with p-values
     plot(boston.pv)

     ask.bak <- par()$ask
     par(ask=TRUE)

     ## highlight clusters with high au p-values
     pvrect(boston.pv)

     ## print the result of multiscale bootstrap resampling
     print(boston.pv, digits=3)

     ## plot diagnostic for curve fitting
     msplot(boston.pv, edges=c(2,4,6,7))

     par(ask=ask.bak)

     ## Print clusters with high p-values
     boston.pp <- pvpick(boston.pv)
     boston.pp

     ## Not run: 
     ## parallel computation via snow package
     library(snow)
     cl <- makeCluster(10, type="MPI")

     ## parallel version of pvclust
     boston.pv <- parPvclust(cl, Boston, nboot=1000)
     ## End(Not run)

