bagging                package:ipred                R Documentation

_B_a_g_g_i_n_g _C_l_a_s_s_i_f_i_c_a_t_i_o_n, _R_e_g_r_e_s_s_i_o_n _a_n_d _S_u_r_v_i_v_a_l _T_r_e_e_s

_D_e_s_c_r_i_p_t_i_o_n:

     Bagging for classification, regression and survival trees.

_U_s_a_g_e:

     ipredbagg.factor(y, X=NULL, nbagg=25, control=
                      rpart.control(minsplit=2, cp=0, xval=0), 
                      comb=NULL, coob=FALSE, ns=length(y), keepX = TRUE, ...)
     ipredbagg.numeric(y, X=NULL, nbagg=25, control=rpart.control(xval=0), 
                       comb=NULL, coob=FALSE, ns=length(y), keepX = TRUE, ...)
     ipredbagg.Surv(y, X=NULL, nbagg=25, control=rpart.control(xval=0), 
                    comb=NULL, coob=FALSE, ns=dim(y)[1], keepX = TRUE, ...)
     bagging(formula, data, subset, na.action=na.rpart, ...)

_A_r_g_u_m_e_n_t_s:

       y: the response variable: either a factor vector of class labels
          (bagging classification trees), a vector of numerical values 
          (bagging regression trees) or an object of class  `Surv'
          (bagging survival trees).

       X: a data frame of predictor variables.

   nbagg: an integer giving the number of bootstrap replications. 

    coob: a logical indicating whether an out-of-bag estimate of the
          error rate (misclassification error, root mean squared error
          or Brier score) should be computed.  See `predict.classbagg'
          for details.

 control: options that control details of the `rpart' algorithm, see
          `rpart.control'. It is wise to set `xval = 0' in order to
          save computing  time. Note that the  default values depend on
          the class of `y'.

    comb: a list of additional models for model combination, see below
          for some examples. Note that argument `method' for
          double-bagging is no longer there,  `comb' is much more
          flexible.

      ns: number of sample to draw from the learning sample. By
          default, the usual bootstrap n out of n with replacement is
          performed.  If `ns' is smaller than `length(y)', subagging
          (Buehlmann and Yu, 2002), i.e. sampling `ns' out of
          `length(y)' without replacement, is performed.

   keepX: a logical indicating whether the data frame of predictors
          should be returned. Note that the computation of the 
          out-of-bag estimator requires  `keepX=TRUE'.

 formula: a formula of the form `lhs ~ rhs' where `lhs'  is the
          response variable and `rhs' a set of predictors.

    data: optional data frame containing the variables in the model
          formula.

  subset: optional vector specifying a subset of observations to be
          used.

na.action: function which indicates what should happen when the data
          contain `NA's.  Defaults to `na.rpart'.

     ...: additional parameters passed to `ipredbagg' or  `rpart',
          respectively.

_D_e_t_a_i_l_s:

     Bagging for classification and regression trees were suggested by
     Breiman (1996a, 1998) in order to stabilise trees. 

     The trees in this function are computed using the implementation
     in the  `rpart' package. The generic function `ipredbagg'
     implements methods for different responses. If `y' is a factor,
     classification trees are constructed. For numerical vectors `y',
     regression trees are aggregated and if `y' is a survival  object,
     bagging survival trees (Hothorn et al, 2003) is performed.  The
     function `bagging' offers a formula based interface to
     `ipredbagg'.

     `nbagg' bootstrap samples are drawn and a tree is constructed  for
     each of them. There is no general rule when to stop the tree 
     growing. The size of the trees can be controlled by `control'
     argument  or `prune.classbagg'. By default, classification trees
     are as large as possible whereas regression trees and survival
     trees are build with the standard options of `rpart.control'. If
     `nbagg=1', one single tree is computed for the whole learning
     sample without bootstrapping.

     If `coob' is TRUE, the out-of-bag sample (Breiman, 1996b) is used
     to estimate the prediction error  corresponding to `class(y)'.
     Alternatively, the out-of-bag sample can be used for model
     combination, an out-of-bag error rate estimator is not  available
     in this case. Double-bagging (Hothorn and Lausen, 2003) computes a
     LDA on the out-of-bag sample and uses the discriminant variables
     as additional predictors for the classification trees. `comb' is
     an optional list of lists with two elements `model' and `predict'.
      `model' is a function with arguments `formula' and `data'. 
     `predict' is a function with arguments `object, newdata' only. If
     the estimation of the covariance matrix in `lda' fails due to a
     limited out-of-bag sample size, one can use `slda' instead. See
     the example section for an example of double-bagging. The
     methodology is not limited to a combination with LDA: bundling
     (Hothorn and Lausen, 2002b)  can be used with arbitrary
     classifiers.

_V_a_l_u_e:

     The class of the object returned depends on `class(y)':
     `classbagg, regbagg' and `survbagg'. Each is a list with elements 

       y: the vector of responses.

       X: the data frame of predictors.

  mtrees: multiple trees: a list of length `nbagg' containing the trees
          (and possibly additional objects) for each bootstrap sample.

     OOB: logical whether the out-of-bag estimate should be computed.

     err: if `OOB=TRUE', the out-of-bag estimate of misclassification
          or root mean squared error or the Brier score for censored
          data.

    comb: logical whether a combination of models was requested.


     For each class methods for the generics `prune',  `print',
     `summary' and `predict' are available for inspection of the
     results and prediction, for example: `print.classbagg',
     `summary.classbagg',  `predict.classbagg'  and `prune.classbagg'
     for classification problems.

_A_u_t_h_o_r(_s):

     Torsten.Hothorn <Torsten.Hothorn@rzmail.uni-erlangen.de>

_R_e_f_e_r_e_n_c_e_s:

     Leo Breiman (1996a), Bagging Predictors. Machine Learning 24(2),
     123-140.

     Leo Breiman (1996b), Out-Of-Bag Estimation. Technical Report <URL:
     ftp://ftp.stat.berkeley.edu/pub/users/breiman/OOBestimation.ps.Z>.

     Leo Breiman (1998), Arcing Classifiers. The Annals of Statistics
     26(3), 801-824.

     Peter Buehlmann and Bin Yu (2002), Analyzing Bagging. The Annals
     of Statistics 30(4), 927-961.

     Torsten Hothorn and Berthold Lausen (2003), Double-Bagging:
     Combining classifiers by bootstrap aggregation. Pattern
     Recognition, 36(6), 1303-1309. 

     Torsten Hothorn and Berthold Lausen (2002b), Bundling Classifiers
     by Bagging Trees. submitted. Preprint available from  <URL:
     http://www.mathpreprints.com/math/Preprint/blausen/20021016/1>.

     Torsten Hothorn, Berthold Lausen, Axel Benner and Martin
     Radespiel-Troeger (2003), Bagging Survival Trees. Statistics in
     Medicine (accepted). Preprint available from <URL:
     http://www.mathpreprints.com/math/Preprint/blausen/20020518/2>.

_E_x_a_m_p_l_e_s:

     # Classification: Breast Cancer data

     data(BreastCancer)

     # Test set error bagging (nbagg = 50): 3.7% (Breiman, 1998, Table 5)

     mod <- bagging(Class ~ Cl.thickness + Cell.size
                     + Cell.shape + Marg.adhesion   
                     + Epith.c.size + Bare.nuclei   
                     + Bl.cromatin + Normal.nucleoli
                     + Mitoses, data=BreastCancer, coob=TRUE)
     print(mod)

     # Test set error bagging (nbagg=50): 7.9% (Breiman, 1996a, Table 2)

     data(Ionosphere)
     Ionosphere$V2 <- NULL # constant within groups

     bagging(Class ~ ., data=Ionosphere, coob=TRUE)

     # Double-Bagging: combine LDA and classification trees

     # predict returns the linear discriminant values, i.e. linear combinations
     # of the original predictors

     comb.lda <- list(list(model=lda, predict=function(obj, newdata)
                                      predict(obj, newdata)$x))

     # Note: out-of-bag estimator is not available in this situation, use
     # errorest

     mod <- bagging(Class ~ ., data=Ionosphere, comb=comb.lda) 

     predict(mod, Ionosphere[1:10,])

     # Regression:

     data(BostonHousing)

     # Test set error (nbagg=25, trees pruned): 3.41 (Breiman, 1996a, Table 8)

     mod <- bagging(medv ~ ., data=BostonHousing, coob=TRUE)
     print(mod)

     learn <- as.data.frame(mlbench.friedman1(200))

     # Test set error (nbagg=25, trees pruned): 2.47 (Breiman, 1996a, Table 8)

     mod <- bagging(y ~ ., data=learn, coob=TRUE)
     print(mod)

     # Survival data

     # Brier score for censored data estimated by 
     # 10 times 10-fold cross-validation: 0.2 (Hothorn et al,
     # 2002)

     data(DLBCL)
     mod <- bagging(Surv(time,cens) ~ MGEc.1 + MGEc.2 + MGEc.3 + MGEc.4 + MGEc.5 +
                                      MGEc.6 + MGEc.7 + MGEc.8 + MGEc.9 +
                                      MGEc.10 + IPI, data=DLBCL, coob=TRUE)

     print(mod)

