earth                 package:earth                 R Documentation

_M_u_l_t_i_v_a_r_i_a_t_e _A_d_a_p_t_i_v_e _R_e_g_r_e_s_s_i_o_n _S_p_l_i_n_e_s

_D_e_s_c_r_i_p_t_i_o_n:

     Build a regression model using the techniques in Friedman's papers
     'Multivariate Adaptive Regression Splines' and 'Fast MARS'.

_U_s_a_g_e:

     ## S3 method for class 'formula':
     earth(formula, data, ...)

     ## Default S3 method:
     earth(x = stop("no 'x' arg"), y = stop("no 'y' arg"),
           weights = NULL, subset = NULL, na.action = na.fail,
           penalty = if(degree > 1) 3 else 2, trace = 0, keepxy = FALSE,
           degree = 1, nk = max(21, 2 * NCOL(x) + 1),
           thresh = 0.001, minspan = 0, newvar.penalty = 0,
           fast.k = 20, fast.beta = 1, fast.h = NULL,
           pmethod = "backward", ppenalty = penalty, nprune = NULL,
           Object  = NULL, Get.crit = get.gcv,
           Eval.model.subsets = eval.model.subsets,
           Print.pruning.pass = print.pruning.pass,
           Force.xtx.prune = FALSE, ...)

_A_r_g_u_m_e_n_t_s:

     All arguments are optional except 'formula', or 'x' and 'y'. NAs
     are not allowed.

     To start off, look at the arguments 'formula', 'x', 'y', 'degree',
     'nk', and 'nprune'.

 formula: Model formula. 

    data: Data frame for 'formula'. 

       x: Matrix containing the independent variables. 

       y: Vector containing the response variable, or in the case of
          multiple responses, a matrix whose columns are the response
          values for each variable. If the 'y' values are very big or
          very small then you may get better results if you 'scale' 'y'
          first for better numerical stability in the forward pass. 

 weights: Weight vector (not yet supported). 

  subset: Index vector specifying which cases to use i.e. which rows in
          'x' to use. Default is NULL, meaning all. 

na.action: NA action. Default is 'na.fail', and only 'na.fail' is
          supported. 

 penalty: Generalized Cross Validation (GCV) penalty per knot. Default
          is 'if(degree>1) 3 else 2'. A value of 0 penalises only
          terms, not knots. The value -1 is a special case, meaning no
          penalty, so GCV=RSS/n. Theory suggests values in the range of
          about 2 to 3. In practice for big data sets, larger values
          can be useful to force a smaller model. See also 'ppenalty'. 

   trace: Trace earth's execution.  Default is 0.  Values:

          0 none 1 overview 2 forward 3 pruning 4 more pruning 5 ... 

  keepxy: Set to TRUE to retain 'x', 'y', and  'subset' in the returned
          value. Default is FALSE

          _The following arguments are for the forward pass_ 

  degree: Maximum degree of interaction (Friedman's mi). Default is 1,
          meaning build an additive model. 

      nk: Maximum number of model terms before pruning. Includes the
          intercept. Default is 'max(21,2*NCOL(x)+1)'. The number of
          terms created by the forward pass will be less than 'nk' if
          there are linearly dependent terms which must be discarded,
          or if a forward stopping condition is reached before 'nk'
          terms. See the section below on the forward pass. 

  thresh: Forward stepping threshold. Default is 0.001. This is one of
          the arguments used to decide when forward stepping should
          terminate. See the section below on the forward pass. 

 minspan: Minimum distance between knots. Set 'trace>=2' to see the
          calculated value.  Values:
           '<0' add to the internally calculated min span (i.e.
          decrease span).
           '0' (default) use internally calculated min span as per
          Friedman's MARS paper section 3.8 with alpha = 0.05. Intended
          to increase resistance to runs of noise in the input data.
           '>0' use instead of the internally calculated min span. Thus
          a value of 1 means consider all knots. 

newvar.penalty: Penalty for adding a new variable in the forward pass
          (Friedman's gamma, equation 74 in the MARS paper). This
          argument can mitigate the effects of collinearity or
          concurvity in the input data. Default is 0, meaning no new
          penalty for adding a new variable. Useful non-zero values
          range from about 0.01 to 0.2 - you will need to experiment. 

  fast.k: Maximum number of considered parent terms, as as described in
          Friedman's Fast MARS paper section 3.0. Default is 20. The
          special value -1 is equivalent to infinity, meaning no Fast
          MARS. Typical values, apart from -1, range from about 20 to
          5, in steps of 5 or 10. In general, with a lower 'fast.k'
          (say 5), 'earth' is faster; with a higher 'fast.k', or with
          'fast.k' disabled (set to -1), 'earth' builds a better model.
          However it is not unusual to get a slightly better model with
          a lower 'fast.k', and you may need to experiment. 

fast.beta: Fast MARS ageing coefficient, as described in the Fast MARS
          paper section 3.1. Default is 1. A value of 0 sometimes gives
          better results. 

  fast.h: Fast MARS h, as described in the Fast MARS paper section 4.0.
          (not yet implemented).

          _The following arguments are for the pruning pass. One
          strategy is to build a large model and then trim it back
          using 'update.earth' with the arguments below. _ 

 pmethod: Pruning method. Default is '"backward"'. One of: 'backward
          none exhaustive forward seqrep'. If 'y' has multiple columns,
          then only 'backward' or 'none' is allowed. Pruning can take a
          while if 'exhaustive' is chosen and the model is big (more
          than about 30 terms). The current version of the 'leaps'
          package used during pruning does not allow user interrupts
          (i.e. you have to kill your R session to interrupt). 

ppenalty: Like 'penalty' but for the pruning pass. Default is
          'penalty'. 

  nprune: Maximum number of terms (including intercept) in the pruned
          model. Default is NULL, meaning all terms. Use this to reduce
          exhaustive search time, or to enforce a maximum model size.

          _The following arguments are for internal or advanced use_ 

  Object: Earth object to be updated, for use by 'update.earth'. 

Get.crit: Criterion function for model selection during pruning. By
          default a function that returns the GCV. See the section
          below on the pruning pass. 

Eval.model.subsets: Function to evaluate model subsets - see notes in
          source code. 

Print.pruning.pass: Function to print pruning pass results - see notes
          in source code. 

Force.xtx.prune: Default is FALSE. This argument pertains to subset
          evaluation in the pruning pass. By default, if 'y' has a
          single column then 'earth' calls the 'leaps' routines; if 'y'
          has multiple columns then 'earth' calls
          'EvalSubsetsUsingXtx'. The 'leaps' routines are based on the
          QR decomposition; the 'EvalSubsetsUsingXtx' routine is based
          on the inverse of X'X. The 'leaps' routines are more accurate
          but do not support multiple responses. Setting
          'Force.xtx.prune=TRUE' forces use of 'EvalSubsetsUsingXtx',
          even if 'y' has a single column. 

     ...: 'earth.formula': arguments passed to 'earth.default'.

          'earth.default': unused, but provided for generic/method
          consistency. 

_V_a_l_u_e:

     An object of class 'earth' which is a list with the components
     listed below. _Term_ refers to a term created during the forward
     pass (each line of the output from 'format.earth' is a term). Term
     number 1 is always the intercept. 

      bx: Matrix of basis functions applied to 'x'. Each column
          corresponds to a selected term. Each row corresponds to a row
          in in the input matrix 'x', after taking 'subset'. See
          'model.matrix.earth' for an example of 'bx' handling. Example
          'bx': (Intercept) h(Girth-12.9) h(12.9-Girth)
          h(Girth-12.9)*h(...
          [1,]          1           0.0           4.6                  
          0
          [2,]          1           0.0           4.3                  
          0
          [3,]          1           0.0           4.1                  
          0
          ...

    dirs: Matrix with ij-th element equal to 1 if term i has a factor
          of the form x_j > c, equal to -1 if term i has a factor of
          the form x_j <= c, and to 0 if x_j is not in term i. This
          matrix includes all terms generated by the forward.pass,
          including those not in 'selected.terms'. Note that the terms
          may not be in pairs, because the forward pass deletes
          linearly dependent terms before handing control to the
          pruning pass. Example 'dirs': Girth Height
          (Intercept)                0      0  #no factors in intercept
          h(Girth-12.9)              1      0  #2nd term uses Girth
          h(12.9-Girth)             -1      0  #3rd term uses Girth
          h(Girth-12.9)*h(Height-76) 1      1  #4th term uses Girth and
          Height
          ...

    cuts: Matrix with ij-th element equal to the cut point for variable
          j in term i. This matrix includes all terms generated by the
          forward.pass, including those not in 'selected.terms'. Note
          that the terms may not be in pairs, because the forward pass
          deletes linearly dependent terms before handing control to
          the pruning pass. Example 'cuts': Girth Height
          (Intercept)                  0.0      0  #intercept, no cuts
          h(Girth-12.9)               12.9      0  #2nd term has cut at
          12.9
          h(12.9-Girth)               12.9      0  #3rd term has cut at
          12.9
          h(Girth-12.9)*h(Height-76)  12.9     76  #4th term has two
          cuts
          ...

selected.terms: Vector of term numbers in the best model. Can be used
          as a row index vector into 'cuts' and 'dirs'. The first
          element 'selected.terms[1]' is always 1, the intercept. 

prune.terms: A matrix specifying which terms appear in which subsets.
          The row index of 'prune.terms' is the model size (the model
          size is the number of terms in the model). Each row is a
          vector of term numbers for the best model of that size. An
          element is 0 if the term is not in the model, thus
          'prune.terms' is a lower triangular matrix, with dimensions
          'nprune x nprune'. The model selected by the pruning pass is
          at row 'length(selected.terms)'. Example 'prune.terms': [1,] 
            1  0  0  0  0  0  0  #intercept-only model
          [2,]    1  2  0  0  0  0  0  #best 2 term model uses terms
          1,2.
          [3,]    1  2  4  0  0  0  0  #best 3 term model uses terms
          1,2,4
          [4,]    1  2  9  8  0  0  0
          ...

     rss: Residual sum-of-squares (RSS) of the model (summed over all
          responses if 'y' has multiple columns). 

     rsq: '1-rss/rss.null'. R-Squared of the model (calculated over all
          responses if 'y' has multiple columns). A measure of how well
          the model fits the training data. 

     gcv: Generalised Cross Validation (GCV) of the model (summed over
          all responses if 'y' has multiple columns). The GCV is
          calculated using 'ppenalty' (as are all returned GCVs). For
          details of the GCV calculation, see equation 30 in Friedman's
          MARS paper and 'earth:::get.gcv'. 

    grsq: '1-gcv/gcv.null'. An estimate of the predictive power of the
          model (calculated over all responses if 'y' has multiple
          columns). Unlike 'rsq', 'grsq' can be negative. A negative
          'grsq' would indicate a severely over parameterised model - a
          model that would not generalise well even though it may be a
          good fit to the training data. Example of a negative 'grsq':
           'earth(mpg~., data=mtcars, pmethod="none", trace=4)' 

rss.per.response: A vector of the RSS for each response. Length is
          'ncol(y)'. The 'rss' component above is  equal to
          'sum(rss.per.response)'. 

rsq.per.response: A vector of the R-Squared for each response. Length
          is 'ncol(y)'. 

gcv.per.response: A vector of the GCV for each response. Length is
          'ncol(y)'. The 'gcv' component above is equal to
          'sum(gcv.per.response)'. 

grsq.per.response: A vector of the GRSq for each response.  Length is
          'ncol(y)'. 

rss.per.subset: A vector of the RSS for each model subset generated by
          the pruning pass. Length is 'nprune'. If 'y' has multiple
          columns, the RSS is summed over all responses for each
          subset. The null RSS (i.e. the RSS of an intercept
          only-model) is 'rss.per.subset[1]'. The 'rss' above is
           'rss.per.subset[length(selected.terms)]'. 

gcv.per.subset: A vector of the GCV for each model in 'prune.terms'.
          Length is is 'nprune'. If 'y' has multiple columns, the GCV
          is summed over all responses for each subset. The null GCV
          (i.e. the GCV of an intercept-only model) is
          'gcv.per.subset[1]'. The 'gcv' above is
          'gcv.per.subset[length(selected.terms)]'. 

fitted.values: Fitted values. A matrix with dimensions 'nrow(y) x
          ncol(y)'. 

residuals: Residuals. A matrix with dimensions 'nrow(y) x ncol(y)'. 

coefficients: Regression coefficients. A matrix with dimensions
          'length(selected.terms) x ncol(y)'. Each column holds the
          least squares coefficients from regressing that column of 'y'
          on 'bx'. The first row holds the intercept coefficients. 

ppenalty: The GCV penalty used during pruning. A copy of 'earth's'
          'ppenalty' argument. 

    call: The call used to invoke 'earth'. 

   terms: Model frame terms. This component exists only if the model
          was built using 'earth.formula'. 

       x: 

       y: 

  subset: Copy of input arguments 'x', 'y', and  'subset'. These
          components exist only if 'keepxy=TRUE'. 

_N_o_t_e:

     *Standard Model Functions*

     Standard model functions such as 'case.names' are provided for
     'earth' objects and are not explicitly documented.

     *Other Implementations*

     The results are similar to but not identical to other Multivariate
     Adaptive Regression Splines implementations. The differences stem
     from the forward pass where very small implementation differences
     (or perturbations of the input data) can cause rather different
     selection of terms and knots. The backward passes give identical
     or near identical results, given the same forward pass results.

     The source code of 'earth' is derived from 'mars' in the 'mda'
     package written by by Trevor Hastie and Robert Tibshirani. See
     also 'mars.to.earth'. Earth's principal advantages over mda::mars
     are that it is much faster and provides plotting and printing
     methods.

     The term 'MARS' is trademarked and licensed exclusively to Salford
     Systems <URL: http://www.salfordsystems.com>. Their implementation
     uses an engine written by Friedman and offers more features than
     'earth'.

     *Limitations*

     The following aspects of MARS are mentioned in Friedman's papers
     but not implemented in 'earth':

     i) Piecewise cubic models
      ii) Specifying which predictors must enter linearly
      iii) Specifying which predictors can interact
      iv) Model slicing ('plotmo' goes part way)
      v) Handling missing values
      vi) Logistic regression 
      vii) Special handling of categorical predictors
      viii) Fast MARS h parameter
      ix) Cross validation to determine 'penalty'.

     *Large Models and Execution Time*

     For a given set of input data, the following can increase the
     speed of the forward pass:

     i) decreasing 'fast.k'
      ii) decreasing 'nk'
      iii) decreasing 'degree'
      iv) increasing 'threshold'
      v) increasing 'min.span'.

     The backward pass is normally much faster than the forward pass,
     unless 'pmethod="exhaustive"'. Reducing 'nprune' reduces
     exhaustive search time. One strategy is to first build a large
     model and then adjust pruning parameters using 'update.earth'.

     *The Forward Pass*

     The forward pass adds terms in pairs until the first of the
     following conditions is met:

     i) reach maximum number of terms '(nterms>=nk)'
      ii) reach DeltaRSq threshold '(DeltaRSq<thresh)' where DeltaRSq
     is the difference in R-Squared caused by adding the current term
     pair
      iii) reach max RSq '(RSq>1-thresh)'
      iv) reach min GRSq '(GRSq< -10)'.

     Set 'trace>=2' to see the stopping condition.

     The result of the forward pass is the set of terms defined by
     '$dirs' and '$cuts'. As a final step, the forward pass deletes
     linearly dependent terms, if any, so all terms in '$dirs' and
     '$cuts' are independent.

     Note that GCVs (via GRSq) are used during the forward pass only as
     one of the stopping conditions and in trace prints. Changing the
     GCV penalty does not change the knot positions.

     *The Pruning Pass*

     The pruning pass is handed the sets of terms created by the
     forward pass and works like this: it determines the subset of
     terms (using 'pmethod') with the lowest RSS for each model size in
     '1:nprune' (see 'Force.xtx.prune' above for some details). It
     saves the RSS and term numbers for each such subset in
     'rss.per.subset' and 'prune.terms'. It then applies the 'Get.crit'
     function with 'ppenalty' to 'rss.per.subset' to yield
     'gcv.per.subset'. It chooses the model with the lowest value in
     'gcv.per.subset', and puts its term numbers into 'selected.terms'.
     Finally, it runs 'lm' to determine the 'fitted.values',
     'residuals', and 'coefficients', by regressing the response 'y' on
     the 'selected.terms' of 'bx'.

     Set 'trace>=3' to trace the pruning pass.

     By default 'Get.crit' is 'earth:::get.gcv'. Alternative 'Get.crit'
     functions can be defined. See the source code of 'get.gcv' for an
     example.

     *Testing on New Data*

     This example demonstrates one way to train on 80% of the data and
     test on the remaining 20%. In practice a dataset larger than the
     one below should be used for splitting. Also, remember that the
     test set should not be used for parameter tuning - use GCVs or
     separate validation sets for that.


     train.subset <- sample(1:nrow(trees), .8 * nrow(trees))
     test.subset <- (1:nrow(trees))[-train.subset]
     a <- earth(Volume ~ ., data = trees[train.subset, ])
     yhat <- predict(a, newdata = trees[test.subset, ])
     y <- trees$Volume[test.subset]
     print(1 - sum((y - yhat)^2) / sum((y - mean(y))^2)) # print
     R-Squared

     *Multiple Response Models*

     If 'y' has K columns then earth builds K simultaneous models. Each
     model has the same set of basis functions (i.e. same 'bx' and
     'selected.terms') but different coefficients (the returned
     'coefficients' will have K columns). The models are built and
     pruned as usual but with the GCVs and RSSs summed across all K
     responses.

     For more details see section 4.1 of  Hastie, Tibshirani, and Buja
     _Flexible Discriminant Analysis by Optimal Scoring_, JASA,
     December 1994 <URL:
     http://www-stat.stanford.edu/~hastie/Papers/fda.pdf>.

     *Using earth with fda and mda*

     'Earth' can be used with 'fda' and 'mda' in the 'mda' package. Use
     'keep.fitted=TRUE' if you want to call 'plot.earth' later
     (actually only necessary for large datasets, see the description
     of  'keep.fitted' in 'fda'). Use 'keepxy=TRUE' if you want to call
     'update' or 'plotmo' later. Use 'trace>=3' to see the call to
     'earth' generated by 'fda' or 'mda'. Example:


     library(mda)
     (a <- fda(Species ~ ., data=iris, keep.fitted=TRUE, method=earth,
     keepxy=TRUE))
     plot(a)
     summary(a$fit)  # examine earth model embedded in fda model
     plot(a$fit)
     plotmo(a$fit, ycolumn=1, ylim=c(-1.5,1.5), clip=FALSE)
     plotmo(a$fit, ycolumn=2, ylim=c(-1.5,1.5), clip=FALSE)

     *Warning and Error Messages*

     'Earth' prints most error and warning messages without printing
     the 'call'. If you are mystified by a warning message, try setting
     'options(warn=2)' and using 'traceback'.

_A_u_t_h_o_r(_s):

     Stephen Milborrow, derived from 'mda::mars' by Trevor Hastie and
     Robert Tibshirani.

     This is an early release and users are encouraged to send feedback
     - use milboATsonicPERIODnet.

_R_e_f_e_r_e_n_c_e_s:

     The primary references are the Friedman papers.  Readers may find
     the MARS section in Hastie, Tibshirani, and Friedman a more
     accessible introduction.  Faraway takes a hands-on approach, using
     the 'ozone' data to compare 'mda::mars' with other techniques. (If
     you use Faraway's examples with 'earth' instead of 'mars', use
     '$bx' instead of '$x'). Earth's pruning pass uses 'leaps' which is
     based on techniques in Miller.

     Faraway _Extending the Linear Model with R_ <URL:
     http://www.maths.bath.ac.uk/~jjf23>

     Friedman  (1991) _Multivariate Adaptive Regression Splines (with
     discussion)_ Annals of Statistics 19/1, 1-141

     Friedman  (1993) _Fast MARS_ Stanford University Department of
     Statistics, Technical Report 110 <URL:
     http://www-stat.stanford.edu/research/index.html>

     Hastie, Tibshirani, and Friedman (2001) _The Elements of
     Statistical Learning_ <URL:
     http://www-stat.stanford.edu/~hastie/pub.htm>

     Miller, Alan (1990, 2nd ed. 2002) _Subset Selection in Regression_

_S_e_e _A_l_s_o:

     'format.earth', 'get.nterms.per.degree',
     'get.nused.preds.per.subset', 'mars.to.earth',
     'model.matrix.earth', 'ozone1', 'plot.earth.models', 'plot.earth',
     'plotmo', 'predict.earth', 'reorder.earth', 'summary.earth',
     'update.earth'

_E_x_a_m_p_l_e_s:

     a <- earth(Volume ~ ., data = trees)
     summary(a, digits = 2)

     # yields:
     #    Call:
     #    earth(formula = Volume ~ ., data = trees)
     #
     #    Expression:
     #      23
     #      +  5.7 * pmax(0,  Girth -     13)
     #      -  2.9 * pmax(0,     13 -  Girth)
     #      + 0.72 * pmax(0, Height -     76)
     #
     #    Number of cases: 31
     #    Selected 4 of 5 terms, and 2 of 2 predictors
     #    Number of terms at each degree of interaction: 1 3 (additive model)
     #    GCV: 11     RSS: 213     GRSq: 0.96     RSq: 0.97

