vabayelMix            package:vabayelMix            R Documentation

_V_a_r_i_a_t_i_o_n_a_l _B_a_y_e_s_i_a_n _G_a_u_s_s_i_a_n _M_i_x_t_u_r_e _M_o_d_e_l

_D_e_s_c_r_i_p_t_i_o_n:

     Learns a gaussian mixture model from data using an optimal
     separable approximation to the posterior density. The optimisation
     uses a variational procedure and implements an iterative ensemble
     learning algorithm. The algorithm gives a framework in which to
     infer the number of clusters in the data set. Prior information
     may be incorporated through specification of hyperparameters in a
     prior distribution. Current version implements a gaussian mixture
     model where the covariances matrices are diagonal.

_A_r_g_u_m_e_n_t_s:

    data: A matrix of dimension Ns x Ndim containing the data to be
          clustered. Algorithm clusters rows of matrix and treats
          columns as dimensions.

   prior: A list of various elements containing prior information as
          obtained for example by using 'UseBasicPrior'. List elements
          are 'prior$mean', 'prior$ivarm', 'prior$ivara', 'prior$ivarb'
          and 'prior$dapi'. The first four are matrices of dimension
          Ncat x Ndim, 'prior$dapi' is a vector of length Ncat.
          'prior$mean' contains the means of the cluster mean gaussian
          priors. 'prior$ivarm' contains the inverse variances for the
          cluster mean gaussian priors. 'prior$ivara' and 'prior$ivarb'
          contain the parameters for the gamma prior distribution of
          the inverse variances of the clusters. 'prior$dapi' is a
          weight vector specifying prior knowledge about the number of
          clusters. If 'prior' is unspecified a complete uninformative
          prior is implemented that assumes rows to be mean normalised
          to zero.

    Ncat: The maximum number of clusters or categories to look for in
          the data set. Algorithm switches off clusters it doesn't
          need. See References.

   nruns: Number of ensemble learning optimisation runs to be
          performed. Each optimisation run uses a different (random)
          starting point.

   npick: The npick runs (out of nruns) that best optimise the cost
          function. See References.

   MaxIt: Maximum number of iterations to be performed for a single
          optimisation run.

conv.tol: Threshold tolerance level for establishing convergence of
          iterations.

     nCV: Number of consecutive iterations to consider in establishing
          convergence of the run at level 'conv.tol'.

verbatim: Logical. If true prints out estimates and cost function value
          per iteration.

_V_a_l_u_e:

     A list with the following components: 

 estvals: A list with components:

     _m_e_a_n Means of gaussian posterior. Matrix of dimension Ncat x Ndim.
          A row containing all zeros means that component is absent.

     _i_v_a_r_m Inverse variances of gaussian posterior. Matrix of dimension
          Ncat x Ndim.

     _i_v_a_r_a,_i_v_a_r_b Parameters of gamma posterior. Matrices of dimension
          Ncat x Ndim.

     _d_a_p_i Parameters of dirichlet posterior giving weights of
          components. A value of 1 means that component is absent. 

     wcl: A matrix of dimension npick x Ns. Each row gives cluster
          assignment of each row of data. Clusters are labeled by
          integers. 

   probs: A list of length npick, each list element is a matrix of
          dimension Ns x Ncat containing the probabilities of
          membership to clusters.

   costs: A vector of length nruns specifying converged values of cost
          function.

    conv: A binary vector of length nruns specifying if that run
          converged (0) or not (1).

_A_u_t_h_o_r(_s):

     Andrew Teschendorffaet21@hutchison-mrc.cam.ac.uk

_R_e_f_e_r_e_n_c_e_s:

_1 D.J.MacKay: Developments in probabilistic modelling with neural
     networks-ensemble learning. In Neural Networks: Artificial
     Intelligence and Industrial Applications. Proceedings of the 3rd
     Annual Symposium on Neural Networksm Nijmengen, Netherlands,
     Berlin Springer, 191-198 (1995).

_2 J.W.Miskin : Ensemble Learning for Independent Component Analysis,
     PhD thesis University of Cambridge December 2000.

_3 A. E. Teschendorff,...et al.: A variational bayesian mixture
     modelling framework for cluster analysis of gene expression data.
     Submitted to Bioinformatics.

_E_x_a_m_p_l_e_s:

     NsTot <- 100; 
     Nspg <- 50;
     Ng <- 2;
     deg.idx <- 1 ;
     data <- matrix( nrow=NsTot, ncol=Ng);
     for( s in 1:Nspg ){
       data[s,] <- rnorm(Ng,0,0.25);
     }
     for( s in (Nspg+1):NsTot){
       data[s,] <- rnorm(Ng,0,0.25);
       data[s,deg.idx] <- rnorm(1,2,0.25);
     }
     types.idx <- c(rep(1,50),rep(2,50));
     useprior.l <- UseBasicPrior(data,rep(1,4));
     vbmix <- vabayelMix(data, prior=NA, Ncat=4, nruns=10, npick=2,MaxIt=500, conv.tol=0.001, nCVconv=10);
     # or could use
     # vbmix <- vabayelMix(data, prior=useprior.l, Ncat=4, nruns=10, npick=2,MaxIt=500, conv.tol=0.001, nCVconv=10);
     plot(1:NsTot,vbmix$wcl[1,],type="h",col=types.idx);

