monomvn               package:monomvn               R Documentation

_M_a_x_i_m_u_m _L_i_k_e_l_i_h_o_o_d _E_s_t_i_m_a_t_i_o_n _f_o_r _M_u_l_t_i_v_a_r_i_a_t_e _N_o_r_m_a_l
_D_a_t_a _w_i_t_h _M_o_n_o_t_o_n_e _M_i_s_s_i_n_g_n_e_s_s

_D_e_s_c_r_i_p_t_i_o_n:

     Maximum likelihood estimation of the mean and covariance matrix of
     multivariate normal (MVN) distributed data with a monotone
     missingness pattern. Through the use of parsimonious/shrinkage
     regressions (e.g., plsr, pcr, ridge, lasso, etc.), where standard
     regressions fail, this function can handle an (almost) arbitrary
     amount of missing data

_U_s_a_g_e:

     monomvn(y, pre = TRUE, method = c("plsr", "pcr", "lasso", "lar",
             "forward.stagewise", "stepwise", "ridge", "factor"), p = 0.9,
             ncomp.max = Inf, batch = TRUE, validation = c("CV", "LOO", "Cp"),
             obs = FALSE, verb = 0, quiet = TRUE)

_A_r_g_u_m_e_n_t_s:

       y: data 'matrix'  were each row is interpreted as a random
          sample from a MVN distribution with missing values indicated
          by 'NA'

     pre: logical indicating whether pre-processing of the 'y' is to be
          performed.  This sorts the columns so that the number of
          'NA's is non-decreasing with the column index 

  method: describes the type of _parsimonious_ (or _shrinkage_)
          regression to be performed when standard least squares
          regression fails. From the 'pls' package we have '"plsr"'
          (plsr, the default) for  partial least squares and '"pcr"'
          (pcr) for standard principal component regression.  From the
          'lars' package (see the '"type"' argument to lars) we have
          '"lasso"' for L1-constrained regression, '"lar"' for least
          angle regression, '"forward.stagewise"' and '"stepwise"' for
          fast implementations of classical forward selection of
          covariates.  From the 'MASS' package we have '"ridge"' as
          implemented by the 'lm.ridge' function.  The '"factor"'
          method treats the first 'p' columns of 'y' as known factors 

       p: when performing regressions, 'p' is the proportion of the
          number of columns to rows in the design matrix before an
          alternative regression 'method' (those above) is performed as
          if least-squares regression has failed.  Least-squares
          regression is known to fail when the number of columns equals
          the number of rows, hence a default of 'p = 0.9 <= 1'.
          Alternatively, setting 'p = 0' forces 'method' to be used for
          _every_ regression. Intermediate settings of 'p' allow the
          user to control when least-squares regressions stop and the
          'method' ones start. When 'method = "factor"' the 'p'
          argument represents an integer (positive) number of initial
          columns of 'y' to treat as known factors 

ncomp.max: maximal number of (principal) components to include in a
          'method'-only meaningful for the '"plsr"' or '"pcr"' methods.
           Large settings can cause the execution to be slow as it
          drastically increases the cross-validation (CV) time

   batch: indicates whether the columns with equal missingness should
          be processed together using a multi-response regression. This
          is more efficient if many OLS regressions are used, but can
          lead to slightly poorer, even unstable, fits when
          parsimonious regressions are used

validation: method for cross validation when applying  a _parsimonious_
          regression method.  The default setting of '"CV"' (randomized
          10-fold cross-validation) is the faster method, but does not
          yield a deterministic result and does not apply for
          regressions on less than ten responses. '"LOO"'
          (leave-one-out cross-validation) is deterministic, always
          applicable, and applied automatically whenever  '"CV"' cannot
          be used.  When standard least squares is appropriate, the
          methods implemented the 'lars' package (e.g. lasso) support
          model choice via the '"Cp"' statistic, which defaults to the
          '"CV"' method when least squares fails.  This argument is
          ignored for the '"ridge"' method; see details below

     obs: logical indicating whether or not to (additionally) compute a
          mean vector and covariance matrix based only on the observed
          data, without regressions.  I.e., means are calculated as
          averages of each non-'NA' entry in the columns of 'y', and
          entries '(a,b)' of the covariance matrix are calculated by
          applying 'cov(ya,yb)' to the jointly non-'NA' entries of
          columns 'a' and 'b' of 'y'

    verb: whether or not to print progress indicators.  The default
          ('verb = 0') keeps quiet, while any positive number causes
          brief statement about dimensions of each regression to print
          to the screen as it happens.  'verb = 2' causes each of the
          ML regression estimators to be printed along with the
          corresponding new entries of the mean and columns of the
          covariance matrix. 'verb = 3' requires that the RETURN key be
          pressed between each print statement

   quiet: causes 'warning's about regressions to be silenced when
          'TRUE'

_D_e_t_a_i_l_s:

     If 'pre = TRUE' then 'monomvn' first re-arranges the columns of
     'y' into nondecreasing order with respect to the number of missing
     ('NA')  entries. Then (at least) the first column should be
     completely observed.  The mean components and covariances between
     the first set of complete columns are obtained through the
     standard 'mean' and 'cov' routines.

     Next each successive group of columns with the same missingness
     pattern is processed in sequence (assuming 'batch = TRUE').
     Suppose a total of 'j' columns have been processed this way
     already.  Let 'y2' represent the non-missing contingent of the
     next group of 'k' columns of 'y' with and identical missingness
     pattern, and let 'y1' be the previously processed 'j-1' columns of
     'y' containing only the rows corresponding to each non-'NA' entry
     in 'y2'.  I.e., 'nrow(y1) = nrow(y2)'.  Note that 'y1' contains no
     'NA' entries since the missing data pattern is monotone. The 'k'
     next entries (indices 'j:(j+k)') of the mean vector, and the
     'j:(j+k)' rows and columns of the covariance matrix are obtained
     by multivariate regression of 'y2' on 'y1'. The regression method
     used (except in the case of 'method = "factor"' depends on the
     number of rows and columns in 'y1' and on the 'p' parameter. 
     Whenever 'ncol(y1) < p*nrow(y1)' least-squares regression is used,
     otherwise 'method = c("pcr", "plsr")'.  If ever a least-squares
     regression fails due to co-linearity then one of the other
     'method's is tried.  The '"factor"' method always involves an OLS
     regression on (a subset of) the first 'p' columns of 'y'.

     All 'method's require a scheme for estimating the amount of
     variability explained by increasing the numbers of coefficients
     (or principal components) in the model. Towards this end, the
     'pls' and 'lars' packages support 10-fold cross validation (CV) or
     leave-one-out (LOO) CV estimates of root mean squared error.  See
     'pls' and 'lars' for more details.  'monomvn' uses CV in all cases
     except when 'nrow(y1) <= 10', in which case CV fails and LOO is
     used.  Whenever 'nrow(y1) <= 3' 'pcr' fails,  so 'plsr' is used
     instead. If 'quiet = FALSE' then a 'warning' is given whenever the
     first choice for a regression fails.

     For 'pls' methods, RMSEs are calculated for a number of components
     in '1:ncomp.max' where a 'NULL' value for 'ncomp.max' it is
     replaced with

     'ncomp.max <- min(ncomp.max, ncol(y2), nrow(y1)-1)'

     which is the max allowed by the 'pls' package.

     Simple heuristics are used to select a small number of components
     ('ncomp' for 'pls'), or number of coefficients (for 'lars'), which
     explains a large amount of the variability (RMSE). The 'lars'
     methods use a one-standard error rule outlined in Section 7.10,
     page 216 of HTF below.  The 'pls' package does not currently
     support the calculation of standard errors for CV estimates of
     RMSE, so a simple linear penalty for increasing 'ncomp' is used
     instead.  The ridge constant (lambda) for 'lm.ridge' is set using
     the 'optimize' function on the 'GCV' output.

     Based on the ML 'ncol(y1)+1' regression coefficients (including
     intercept) obtained for each of the columns of 'y2', and on the
     corresponding 'matrix' of residual sum of squares, and on the
     previous 'j-1' means and rows/cols of the covariance matrix, the
     'j:(j+k)' entries and rows/cols can be filled in as described by
     Little and Rubin, section 7.4.3.

     Once every column has been processed, the entries of the mean
     vector, and rows/cols of the covariance matrix are re-arranged
     into their original order.

_V_a_l_u_e:

     'monomvn' returns an object of class '"monomvn"', which is a
     'list' containing a subset of the components below.

   call : a copy of the function call as used

     mu : estimated mean vector with columns corresponding to the
          columns of 'y' 

      S : estimated covariance matrix with rows and columns
          corresponding to the columns of 'y' 

     na : when 'pre = TRUE' this is a vector containing number of 'NA'
          entries in each column of 'y'

      o : when 'pre = TRUE' this is a vector containing the index of
          each column in the sorting of the columns of 'y' obtained by
          'o <- order(na)'

 method : method of regression used on each column, or '"complete"'
          indicating that no regression was necessary

  ncomp : number of components in a 'plsr' or 'pcr' regression, or 'NA'
          if such a method was not used.  This field is used to record
          lambda when 'lm.ridge' is used 

 lambda : if 'method' is one of 'c("lasso", "forward.stagewise",
          "ridge")', then this field records the lambda penalty
          parameters used

 mu.obs : when 'obs = TRUE' this is the observed mean vector

  S.obs : when 'obs = TRUE' this is the observed covariance matrix,
          as described above.  Note that 'S.obs' is usually not
          positive definite 

_N_o_t_e:

     The CV in 'plsr' and 'lars' are random in nature, and so can be
     dependent on the random seed.  Use 'validation=LOO' for
     deterministic (but slower) result.

     When using 'method = "factor"' in the current version of the
     package, the factors in the first 'p' columns of 'y' must also
     obey the monotone pattern, and, have no more 'NA' entries than the
     other columns of 'y'.

     Be warned that the 'lars' implementation of '"forward.stagewise"'
     can sometimes get stuck in (what seems like) an infinite loop.
     This is not a bug in the 'monomvn' package; the bug has been
     reported to the authors of 'lars'

_A_u_t_h_o_r(_s):

     Robert B. Gramacy bobby@statslab.cam.ac.uk

_R_e_f_e_r_e_n_c_e_s:

     Robert B. Gramacy, Joo Hee Lee, and Ricardo Silva (2007). _On
     estimating covariances between many assets with histories  of
     highly variable length_. 
      Preprint available on arXiv:0710.5837: <URL:
     http://arxiv.org/abs/0710.5837>

     Roderick J.A. Little and Donald B. Rubin (2002). _Statistical
     Analysis with Missing Data_, Second Edition. Wilely.

     Bjorn-Helge Mevik and Ron Wehrens (2007). _The 'pls' Package:
     Principal Component and Partial Least Squares Regression in R._ 
     Journal of Statistical Software *18*(2)

     Bradley Efron, Trevor Hastie, Ian Johnstone and Robert Tibshirani
     (2003). _Least Angle Regression (with discussion)._ Annals of
     Statistics *32*(2); see also
      <URL:
     http://www-stat.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf>

     Trevor Hastie, Robert Tibshirani and Jerome Friedman (2002).
     _Elements of Statistical Learning._ Springer, NY. [HTF]

     Some of the code for 'monomvn', and its subroutines, was inspired
     by code found on the world wide web, written by Daniel Heitjan,
      <URL:
     http://www.cceb.upenn.edu/pages/heitjan/courses/bsta782/examples/fcn.q>

     <URL: http://www.statslab.cam.ac.uk/~bobby/monomvn.html>

_S_e_e _A_l_s_o:

     'bmonomvn', 'em.norm' in the now defunct 'norm' package, and
     'mlest' in the 'mvnmle' package

_E_x_a_m_p_l_e_s:

     ## standard usage, duplicating the results in
     ## Little and Rubin, section 7.4.3 -- try adding 
     ## verb=3 argument for a step-by-step breakdown
     data(cement.miss)
     out <- monomvn(cement.miss)
     out
     out$mu
     out$S

     ##
     ## A bigger example, comparing the various methods
     ##

     ## generate N=100 samples from a 10-d random MVN
     xmuS <- randmvn(100, 20)

     ## randomly impose monotone missingness
     xmiss <- rmono(xmuS$x)

     ## plsr
     oplsr <- monomvn(xmiss, obs=TRUE)
     oplsr
     Ellik.norm(oplsr$mu, oplsr$S, xmuS$mu, xmuS$S)

     ## calculate the complete and observed RMSEs
     n <- nrow(xmiss) - max(oplsr$na)
     x.c <- xmiss[1:n,]
     mu.c <- apply(x.c, 2, mean)
     S.c <- cov(x.c)*(n-1)/n
     Ellik.norm(mu.c, S.c, xmuS$mu, xmuS$S)
     Ellik.norm(oplsr$mu.obs, oplsr$S.obs, xmuS$mu, xmuS$S)

     ## plcr
     opcr <- monomvn(xmiss, method="pcr")
     Ellik.norm(opcr$mu, opcr$S, xmuS$mu, xmuS$S)

     ## ridge regression
     oridge <- monomvn(xmiss, method="ridge")
     Ellik.norm(oridge$mu, oridge$S, xmuS$mu, xmuS$S)

     ## lasso
     olasso <- monomvn(xmiss, method="lasso")
     Ellik.norm(olasso$mu, olasso$S, xmuS$mu, xmuS$S)

     ## lar
     olar <- monomvn(xmiss, method="lar")
     Ellik.norm(olar$mu, olar$S, xmuS$mu, xmuS$S)

     ## forward.stagewise
     ofs <- monomvn(xmiss, method="forward.stagewise")
     Ellik.norm(ofs$mu, ofs$S, xmuS$mu, xmuS$S)

     ## stepwise
     ostep <- monomvn(xmiss, method="stepwise")
     Ellik.norm(ostep$mu, ostep$S, xmuS$mu, xmuS$S)

