blasso                package:monomvn                R Documentation

_B_a_y_e_s_i_a_n _L_a_s_s_o/_N_G _a_n_d _R_i_d_g_e _R_e_g_r_e_s_s_i_o_n

_D_e_s_c_r_i_p_t_i_o_n:

     Inference for ordinary least squares, lasso/NG and ridge
     regression models by (Gibbs) sampling from the Bayesian posterior
     distribution, augmented with Reversible Jump for model selection

_U_s_a_g_e:

     bridge(X, y, T = 1000, thin = NULL, RJ = TRUE, M = NULL,
            beta = NULL, lambda2 = 1, s2 = var(y-mean(y)), mprior = 0,
            rd = NULL, ab = NULL, theta=0, rao.s2 = TRUE, icept = TRUE,
            normalize = TRUE, verb = 1)
     blasso(X, y, T = 1000, thin = NULL, RJ = TRUE, M = NULL,
            beta = NULL, lambda2 = 1, s2 = var(y-mean(y)),
            case = c("default", "ridge", "hs", "ng"), mprior = 0, rd = NULL,
            ab = NULL, theta=0, rao.s2 = TRUE, icept = TRUE, 
            normalize = TRUE, verb = 1)

_A_r_g_u_m_e_n_t_s:

       X: 'data.frame', 'matrix', or vector of inputs 'X' 

       y: vector of output responses 'y' of length equal to the leading
          dimension (rows) of 'X', i.e., 'length(y) == nrow(X)'

       T: total number of MCMC samples to be collected 

    thin: number of MCMC samples to skip before a sample is collected
          (via thinning).  If 'NULL' (default), then 'thin' is
          determined based on the regression model implied by 'RJ',
          'lambda2', and 'ncol(X)'; and also on the errors model
          implied by 'theta' and 'nrow(X)' 

      RJ: if 'TRUE' then model selection on the columns of the design
          matrix (and thus the parameter 'beta' in the model) is
          performed by Reversible Jump (RJ) MCMC.  The initial model is
          specified by the 'beta' input, described below, and the
          maximal number of covariates in the model is specified by 'M' 

       M: the maximal number of allowed covariates (columns of 'X') in
          the model.  If input 'lambda2 > 0' then any 'M <= ncol(X)' is
          allowed.  Otherwise it must be that 'M <= min(ncol(X),
          length(y)-1)', which is default value when a 'NULL' argument
          is given 

    beta: initial setting of the regression coefficients.  Any
          zero-components will imply that the corresponding covariate
          (column of 'X') is not in the initial model.  When input 'RJ
          = FALSE' (no RJ) and 'lambda2 > 0' (use lasso) then no
          components are allowed to be exactly zero.  The default
          setting is therefore contextual; see below for details 

 lambda2: square of the initial lasso penalty parameter.  If zero, then
          least squares regressions are used 

      s2: initial variance parameter 

    case: specifies if ridge regression or the Normal-Gamma should be
          done instead of the lasso; only meaningful when 'lambda2 > 0' 

  mprior: prior on the number of non-zero regression coefficients (and
          therefore covariates) 'm' in the model. The default ('mprior
          = 0') encodes the uniform prior on '0 <= m <= M'. A scalar
          value '0 < mprior < 1' implies a Binomial prior
          'Bin(m|n=M,p=mprior)'. A 2-vector 'mprior=c(g,h)' of positive
          values 'g' and 'h' represents gives 'Bin(m|n=M,p)' prior
          where 'p~Beta(g,h)' 

      rd: '=c(r, delta)', the alpha (shape) parameter and beta (rate)
          parameter to the gamma distribution prior 'G(r,delta)' for
          the lambda2 parameter under the lasso model; or, the alpha
          (shape) parameter and beta (scale) parameter to the
          inverse-gamma distribution 'IG(r/2, delta/2)' prior for the
          lambda2 parameter under the ridge regression model. A default
          of 'NULL' generates appropriate non-informative values
          depending on the nature of the regression.  See the details
          below for information on the special settings for ridge
          regression 

      ab: '=c(a, b)', the alpha (shape) parameter and the beta (scale)
          parameter for the inverse-gamma distribution prior 'IG(a,b)'
          for the variance parameter 's2'.  A default of 'NULL'
          generates appropriate non-informative values depending on the
          nature of the regression 

   theta: the rate parameter ('> 0') to the exponential prior on the
          degrees of freedom paramter 'nu' under a model with Student-t
          errors implemented by a scale-mixture prior. The default
          setting of 'theta = 0' turns off this prior, defaulting to a
          normal errors prior 

 rao.s2 : indicates whether Rao-Blackwellized samples for s^2 should be
          used (default 'TRUE'); see below for more details 

   icept: if 'TRUE', an implicit intercept term is fit in the model,
          otherwise the the intercept is zero; default is 'TRUE' 

normalize: if 'TRUE', each variable is standardized to have unit
          L2-norm, otherwise it is left alone; default is 'TRUE' 

    verb: verbosity level; currently only 'verb = 0' and 'verb = 1' are
          supported 

_D_e_t_a_i_l_s:

     The Bayesian lasso model and Gibbs Sampling algorithm is described
     in detail in Park & Casella (2008).  The algorithm implemented by
     this function is identical to that described therein, with the
     exception of an added option to use a Rao-Blackwellized sample
     of s^2 (with beta integrated out) for improved mixing, and the
     model selections by RJ described below. When input argument
     'lambda2 = 0' is supplied, the model is a simple hierarchical
     linear model where (beta,s2) is given a Jeffrey's prior

     Specifying 'RJ = TRUE' causes Bayesian model selection and
     averaging to commence for choosing which of the columns of the
     design matrix 'X' (and thus parameters 'beta') should be included
     in the model.  The zero-components of the 'beta' input specify
     which columns are in the initial model, and 'M' specifies the
     maximal number of columns.

     The RJ mechanism implemented here for the Bayesian lasso model
     selection differs from the one described by Hans (2008), which is
     based on an idea from Geweke (1996). Those methods require
     departing from the Park & Casella (2008) latent-variable model and
     requires sampling from each conditional beta[i] | beta[-i], ...
     for all i, since a mixture prior with a point-mass at zero is
     placed on each beta[i].  Out implementation here requires no such
     special prior and retains the joint sampling from the full beta
     vector of non-zero entries, which we believe yields better mixing
     in the Markov chain.  RJ proposals to increase/decrease the number
     of non-zero entries does proceed component-wise, but the
     acceptance rates are high due due to marginalized between-model
     moves (Troughton & Godsill, 1997).

     When the lasso prior or RJ is used, the automatic thinning level
     (unless 'thin != NULL') is determined by the number of columns of
     'X' since this many latent variables are introduced

     Bayesian ridge regression is implemented as a special case via the
     'bridge' function.  This essentially calls 'blasso' with 'case =
     "ridge"'. A default setting of 'rd = c(0,0)' is implied by 'rd =
     NULL', giving the Jeffery's prior for the penalty parameter
     lambda^2 unless 'ncol(X) >= length(y)' in which case the proper
     specification of 'rd = c(5,10)' is used instead.

     The Normal-Gamma prior (Griffin & Brown, 2009) is implemented as
     an extension to the Bayesian lasso with 'case = "ng"'

     When 'theta > 0' then the Student-t errors via scale mixtures (and
     thereby extra latent variables 'omega2') of Geweke (1993) is
     applied as an extension to the Bayesian lasso/ridge model. If
     Student-t errors are used the automatic thinning level is
     augmented (unless 'thin != NULL') by the number of rows in 'X'
     since this many latent variables are introduced

_V_a_l_u_e:

     'blasso' returns an object of class '"blasso"', which is a 'list'
     containing a copy of all of the input arguments as well as of the
     components listed below.

   call : a copy of the function call as used

     mu : a vector of 'T' samples of the (un-penalized) intercept
          parameter 

   beta : a 'T*ncol(X)' 'matrix' of 'T' samples from the (penalized)
          regression coefficients

      m : the number of non-zero entries in each vector of 'T' samples
          of 'beta'

     s2 : a vector of 'T' samples of the variance parameter

lambda2 : a vector of 'T' samples of the penalty parameter

  gamma : a vector of 'T' with the gamma parameter when 'case = "ng"' 

   tau2i: a 'T*ncol(X)' 'matrix' of 'T' samples from the (latent)
          inverse diagonal of the prior covariance matrix for 'beta',
          obtained for Lasso regressions 

  omega2: a 'T*nrow(X)' 'matrix' of 'T' samples from the (latent)
          diagonal of the covariance matrix of the response providing a
          scale-mixture implementation of Student-t errors with degrees
          of freedom 'nu' when active (input 'theta > 0') 

      nu: a vector of 'T' samples of the degrees of freedom parameter
          to the Student-t errors mode when active (input 'theta > 0') 

      pi: a vector of 'T' samples of the Binomial proportion 'p' that
          was given a Beta prior, as described above for the 2-vector
          version of the 'mprior' input

   lpost: the log posterior probability of each (saved) sample of the
          joint parameters 

    llik: the log likelihood of each (saved) sample of the parameters 

llik.norm: the log likelihood of each (saved) sample of the parameters
          under the Normal errors model when sampling under the
          Student-t model; i.e., it is not present unless 'theta > 0' 

_N_o_t_e:

     Whenever 'ncol(X) >= nrow(X)' it must be that either 'RJ = TRUE'
     with 'M <= nrow(X)-1' (the default) or that the lasso is turned on
     with 'lambda2 > 0'.  Otherwise the regression problem is
     ill-posed.

     Since the starting values are considered to be first sample (of
     'T'), the total number of (new) samples obtained by Gibbs Sampling
     will be 'T-1'

_A_u_t_h_o_r(_s):

     Robert B. Gramacy bobby@statslab.cam.ac.uk

_R_e_f_e_r_e_n_c_e_s:

     Park, T., Casella, G. (2008). _The Bayesian Lasso._
      Journal of the American Statistical Association, Volume 103,
     Number 482, June 2008 , pp. 681-686(6)
      <URL: http://www.stat.ufl.edu/~casella/Papers/Lasso.pdf>

     Griffin, J.E. and Brown, P.J. (2009). _Inference with Normal-Gamma
     prior distributions in regression problems._ Tech. rep.,
     University of Kent.
      <URL:
     http://www.kent.ac.uk/ims/personal/jeg28/paper_ba10_final.pdf>

     Chris Hans. (2008). _Bayesian Lasso regression._ Technical Report
     No. 810, Department of Statistics, The Ohio State University,
     Columbus, OH 43210.
      <URL: http://www.stat.osu.edu/~hans/Papers/blasso.pdf>

     Geweke, J. (1996). _Variable selection and model comparison in
     regression._ In Bayesian Statistics 5.  Editors: J.M. Bernardo,
     J.O. Berger, A.P. Dawid and A.F.M. Smith, 609-620. Oxford Press.

     Paul T. Troughton and Simon J. Godsill (1997). _A reversible jump
     sampler for autoregressive time series, employing full
     conditionals to achieve efficient model space moves._ Technical
     Report CUED/F-INFENG/TR.304, Cambridge University Engineering
     Department.

     Geweke, J. (1993) _Bayesian treatment of the independent Student-t
     linear model._ Journal of Applied Econometrics, Vol. 8, S19-S40

     <URL: http://www.statslab.cam.ac.uk/~bobby/monomvn.html>

_S_e_e _A_l_s_o:

     'lm' , 'lars' in the 'lars' package, 'regress', 'lm.ridge' in the
     'MASS' package

_E_x_a_m_p_l_e_s:

     ## following the lars diabetes example
     data(diabetes)
     attach(diabetes)

     ## Ordinary Least Squares regression
     reg.ols <- regress(x, y)

     ## Lasso regression
     reg.las <- regress(x, y, method="lasso")

     ## Bayesian Lasso regression
     reg.blas <- blasso(x, y)

     ## summarize the beta (regression coefficients) estimates
     plot(reg.blas, burnin=200)
     points(drop(reg.las$b), col=2, pch=20)
     points(drop(reg.ols$b), col=3, pch=18)
     legend("topleft", c("blasso-map", "lasso", "lsr"),
            col=c(2,2,3), pch=c(21,20,18))

     ## plot the size of different models visited
     plot(reg.blas, burnin=200, which="m")

     ## get the summary
     s <- summary(reg.blas, burnin=200)

     ## calculate the probability that each beta coef != zero
     s$bn0

     ## summarize s2
     plot(reg.blas, burnin=200, which="s2")
     s$s2

     ## summarize lambda2
     plot(reg.blas, burnin=200, which="lambda2")
     s$lambda2

     ## fit with Student-t errors
     ## (~400-times slower due to automatic thinning level)
     regt.blas <- blasso(x, y, theta=0.1)

     ## plotting some information about nu, and quantiles
     plot(regt.blas, "nu", burnin=200)
     quantile(regt.blas$nu[-(1:200)], c(0.05, 0.95))

     ## Bayes Factor shows strong evidence for Student-t model
     mean(exp(regt.blas$llik[-(1:200)] - regt.blas$llik.norm[-(1:200)]))

     ## clean up
     detach(diabetes)

     ##
     ## a big-p small-n example
     ##

     n <- 25; m <- 51
     xmuS <- randmvn(n, m)
     X <- xmuS$x[,1:(m-1)]
     Y <- drop(xmuS$x[,m])
     obl <- blasso(X, Y, verb=0)

     ## plot summary of the model order
     plot(obl, burnin=10, which="m")

     ## fit a standard lasso model
     oml <- regress(X, Y, method="lasso")

     ## compare via RMSE, most often blasso will win
     beta <- xmuS$S[m,-m] %*% solve(xmuS$S[-m,-m])
     sqrt(mean((apply(obl$beta, 2, mean) - beta)^2))
     sqrt(mean((oml$b[-1] - beta)^2))

     ## now try both Bayesian & ML ridge regression
     obr <- bridge(X, Y, verb=0)
     omr <- regress(X, Y, method="ridge")
     sqrt(mean((apply(obr$beta[-c(1:200),], 2, mean) - beta)^2))
     sqrt(mean((omr$b[-1] - beta)^2))

