regmix                  package:fpc                  R Documentation

_M_i_x_t_u_r_e _M_o_d_e_l _M_L _f_o_r _C_l_u_s_t_e_r_w_i_s_e _L_i_n_e_a_r _R_e_g_r_e_s_s_i_o_n

_D_e_s_c_r_i_p_t_i_o_n:

     Computes an ML-estimator for clusterwise linear regression under a
     regression mixture model with Normal errors. Parameters are
     proportions, regression coefficients and error variances, all
     independent of the values of the independent variable, and all may
     differ for different clusters. Computation is by the EM-algorithm.
     The number of clusters is estimated via the Bayesian Information
     Criterion (BIC).

_U_s_a_g_e:

     regmix(indep, dep, ir=1, nclust=1:7, icrit=1.e-5, minsig=1.e-6, warnings=FALSE)

     regem(indep, dep, m, cln, icrit=1.e-5, minsig=1.e-6, warnings=FALSE) 

_A_r_g_u_m_e_n_t_s:

   indep: numerical matrix or vector. Independent variables.

     dep: numerical vector. Dependent variable.

      ir: positive integer. Number of iteration runs for every number
          of clusters.

  nclust: vector of positive integers. Numbers of clusters.

   icrit: positive numerical. Stopping criterion for the iterations
          (difference of loglikelihoods).

  minsig: positive numerical. Minimum value for the variance parameters
          (likelihood is unbounded if variances are allowed to converge
          to 0).

warnings: logical. If 'TRUE', warnings are given during the EM
          iteration in case of collinear regressors, too small mixture
          components and error variances smaller than minimum. In the
          former two cases, the algorithm is terminated without a
          result, but an optimal solution is still computed from other
          algorithm runs (if there are others). In the latter case, the
          corresponding variance is set to the minimum.

     cln: positive integer. (Single) number of clusters.

       m: matrix of positive numericals. Number of columns must be
          'cln'. Number of rows must be number of data points. Columns
          must add up to 1. Initial configuration for the EM iteration
          in terms of a probabilty vector for every point which gives
          its degree of membership to every cluster. As generated by
          'randcmatrix'.

_D_e_t_a_i_l_s:

     The result of the EM iteration depends on the initial
     configuration, which is generated randomly by 'randcmatrix' for
     'regmix'. 'regmix' calls 'regem'. To provide the initial
     configuration manually, use parameter 'm' of 'regem' directly.
     Take a look at the example about how to generate 'm' if you want
     to specify initial parameters.

     The original paper DeSarbo and Cron (1988) suggests the AIC for
     estimating the number of clusters. The use of the BIC is advocated
     by Wedel and DeSarbo (1995). The BIC is defined here as '2*loglik
     - log(n)*((p+3)*cln-1)', 'p' being the number of independent
     variables, i.e., the larger the better.

     See the entry for the input parameter 'warnings' for the treatment
     of several numerical problems.

_V_a_l_u_e:

     'regmix' returns a list containing the components 'clnopt, loglik,
     bic, coef, var, eps, z, g'.

     'regem'  returns a list containing the components 'loglik,  coef,
     var, z, g, warn'.

  clnopt: optimal number of clusters according to the BIC.

  loglik: loglikelihood for the optimal model.

     bic: vector of BIC values for all numbers of clusters in 'nclust'.

    coef: matrix of regression coefficients. First row: intercept
          parameter. Second row: parameter of first independent
          variable and so on. Columns corresponding to clusters.

     var: vector of error variance estimators for the clusters.

     eps: vector of cluster proportion estimators.

       z: matrix of estimated a posteriori probabilities of the points
          (rows) to be generated by the clusters (columns). Compare
          input argument 'm'.

       g: integer vector of estimated cluster numbers for the points
          (via argmax over 'z').

    warn: logical. 'TRUE' if one of the estimated clusters has too few
          points and/or collinear regressors.

_A_u_t_h_o_r(_s):

     Christian Hennig chrish@stats.ucl.ac.uk <URL:
     http://www.homepages.ucl.ac.uk/~ucakche/>

_R_e_f_e_r_e_n_c_e_s:

     DeSarbo, W. S. and Cron, W. L. (1988) A maximum likelihood
     methodology for clusterwise linear regression, _Journal of
     Classification_ 5, 249-282.

     Wedel, M. and DeSarbo, W. S. (1995) A mixture likelihood approach
     for generalized linear models, _Journal of Classification_ 12,
     21-56.

_S_e_e _A_l_s_o:

     'fixreg' for fixed point clusters for clusterwise linear
     regression.

     'EMclust' for Normal mixture model fitting (non-regression).

_E_x_a_m_p_l_e_s:

     set.seed(12234)
     data(tonedata)
     # Note: If you do not use the installed package, replace this by
     # tonedata <- read.table("(path/)tonedata.txt", header=TRUE)
     attach(tonedata)
     rmt1 <- regmix(stretchratio,tuned,nclust=1:2)
     # nclust=1:2 makes the example fast;
     # a more serious application would rather use the default.
     rmt1$g
     rmt1$bic
     # start with initial parameter values
     cln <- 3
     n <- 150
     initcoef <- cbind(c(2,0),c(0,1),c(0,2.5))
     initvar <- c(0.001,0.0001,0.5)
     initeps <- c(0.4,0.3,0.3)
     # computation of m from initial parameters
     m <- matrix(nrow=n, ncol=cln)
     stm <- numeric(0)
     for (i in 1:cln)
       for (j in 1:n){
         m[j,i] <- initeps[i]*dnorm(tuned[j],mean=initcoef[1,i]+
                   initcoef[2,i]*stretchratio[j], sd=sqrt(initvar[i]))
       }
       for (j in 1:n){
         stm[j] <- sum(m[j,])
         for (i in 1:cln)
           m[j,i] <- m[j,i]/stm[j]
       } 
     rmt2 <- regem(stretchratio, tuned, m, cln)
     rmt2bic <- 2*rmt2$loglik - log(150)*(4*cln-1)
     rmt2bic

