logreg               package:LogicReg               R Documentation

_L_o_g_i_c _R_e_g_r_e_s_s_i_o_n

_D_e_s_c_r_i_p_t_i_o_n:

     Fit one or a series of Logic Regression models,  carry out
     cross-validation or permutation tests for such models, or fit
     Monte Carlo Logic Regression models.

     Logic regression is a (generalized) regression methodology that is
     primarily applied when most of the covariates in the data to be
     analyzed are binary. The goal of logic regression is to find
     predictors that are Boolean (logical) combinations of the original
     predictors.  Currently the Logic Regression methodology has
     scoring functions for linear regression (residual sum of squares),
     logistic regression (deviance), classification
     (misclassification),  proportional hazards models (partial
     likelihood), and exponential survival models (log-likelihood). A
     feature of the Logic Regression methodology is that it is easily
     possible to extend the method to write ones own scoring function
     if you have a different scoring function.  'logreg.myown' contains
     information on how to do so.

_U_s_a_g_e:

     logreg(resp, bin, sep, wgt, cens, type, select, ntrees, nleaves, 
            penalty, seed, kfold, nrep, oldfit, anneal.control, tree.control,
            mc.control)

_A_r_g_u_m_e_n_t_s:

    resp: vector with the response variables. Let 'n1' be the length of
          this column.

     bin: matrix or data frame with binary data.  Let 'n2' be the
          number of columns of this object.  'bin' should have 'n1'
          rows.

     sep: (optional) matrix or data frame that is fitted additively in
          the logic regression model.  'sep' should have 'n1' rows.
          When exponential survival models ('type = 5') are used, the
          additive predictors have to be binary. When logistic
          regression models ('type = 3') are used 'logreg' is much
          faster when all additive predictors are binary.

     wgt: (optional) vector of length 'n1' with case weights; default
          is 'rep(1,n1)'.

    cens: (optional) an indicator variable with censoring indicators if
          'type' equals 4 (proportional hazards model) or 5
          (exponential survival model); default is 'rep(1,n1)'.

    type: type of model to be fit: (1) classification, (2) regression,
          (3) logistic regression, (4) proportional hazards model (Cox
          regression), (5) exponential survival model, or (0) your own
          scoring function. If 'type = 0', the code needs to be
          recompiled, uncompiled 'type = 0' results in a constant score
          of 0, which may be useful to generate a sample from the prior
          when 'select = 7' (Monte Carlo Logic Regression).

  select: type of model selection to be carried out: (1) fit a single
          model, (2) fit multiple models, (3) cross-validation, (4)
          null-model permutation test, (5) conditional permutation
          test, (6) a greedy stepwise algorithm, or (7) Monte Carlo
          Logic Regression (using MCMC). See details below.

  ntrees: number of logic trees to be fit. A single number if you
          select to fit a single model ('select = 1'), carry out the
          null-model permutation test ('select = 4'), carry out greedy
          stepwise selection  ('select = 6') or, select using MCMC
          ('select = 7'), or a range (e.g. 'c(ntreeslow,ntreeshigh)')
          for any of the other selection options.  In our applications,
          we usually ended up with models having between one and four
          trees. In general, fitting one and two trees in the initial
          exploratory analysis is a good idea.

 nleaves: maximum number of leaves to be fit in all trees combined.  A
          single number if you select to fit a single model ('select =
          1') carry out the null-model permutation test ('select = 4'),
          carry out greedy stepwise selection  ('select = 6') or,
          select using MCMC ('select = 7'), or a range (e.g.
          'c(nleaveslow,nleaveshigh)') for any of the other selection
          options.  If 'select' is 1, 4, 6, or 7, the default is '-1',
          which is expanded to become 'ntrees * tree.control\$treesize
          '.

 penalty: specifying the penalty parameter allows you to penalize the
          score of larger models. The penalty takes the form 'penalty'
          times the number of leaves in the model. (For some score
          functions, we compute average scores: the penalty is
          naturally adjusted.)  Thus 'penalty = 2' is somewhat
          comparable to AIC. Note, however, that there is no relation
          between the size of the model and the number of parameters
          (dimension) of the model, as is usual the case for AIC like
          penalties.  'penalty' is only relevant when 'select = 1'.

    seed: a seed for the random number generator.  The random seed is
          taken to be 'abs(seed)'.  For the cross-validation version,
          if  'seed < 0' the sequence of the cases is not permuted for
          division in training-test sets. This is useful if you already
          permuted the sequence, and wish to compare results with other
          approaches, or if there is a relation between the sequence of
          the cases, for example for a matched case-control study.

   kfold: the number of groups the cases are randomly assigned to. In
          turn, the model is trained on '(kfold - 1)' of those groups,
          and scored on the group left out. Common choices are 'kfold =
          5' and 'kfold = 10'. Only relevant for cross-validation
          ('select = 3').  

    nrep: the number of runs on permuted data for each model size. We
          recommend first running this program with a small number of
          repetitions (e.g. 10 or 25) before sending off a big job. 
          Only relevant for the null-model test ('select = 4') or the
          permutation test ('select = 5').

  oldfit: object of class 'logreg', typically the result of a previous
          call to 'logreg'.  All options that are not specified default
          to the value used in 'oldfit'. For the permutation test
          ('select = 5') an 'oldfit' object obtained with 'select = 2'
          (fit multiple models) is mandatory, as the best models of
          each size need to be in place.

anneal.control: simulated annealing parameters - best set using the
          function 'logreg.anneal.control'.

tree.control: several secondary parameters - best set using the
          function 'logreg.tree.control'.

mc.control: Markov chain Monte Carlo parameters - best set using the
          function 'logreg.mc.control'.

_D_e_t_a_i_l_s:

     Logic Regression is an adaptive regression methodology that
     attempts to construct predictors as Boolean combinations of binary
     covariates.

     In most regression problems a model is developed that only relates
     the main effects (the predictors or transformations thereof) to
     the response.  Although interactions between predictors are
     considered sometimes as well, those interactions are usually kept
     simple (two- to three-way interactions at most).  But often,
     especially when all predictors are binary, the interaction between
     many predictors is what causes the differences in response. This
     issue often arises in the analysis of SNP microarray data or in
     data mining problems.  Given a set of binary predictors X, we try
     to create new, better predictors for the response by considering
     combinations of those binary predictors. For example, if the
     response is binary as well (which is not required in general), we
     attempt to find decision rules such as ``if X1, X2, X3 and X4 are
     true'', or ``X5 or X6 but not X7 are true'', then the response is
     more likely to be in class 0. In other words, we try to find
     Boolean statements involving the binary predictors that enhance
     the prediction for the response.  In more specific terms: Let
     X1,...,Xk be binary predictors, and let Y be a response variable. 
     We try to fit regression models of the form g(E[Y]) = b0 + b1 L1+
     ...+ bn Ln, where Lj is a Boolean expression of the predictors X,
     such as Lj=[(X2 or X4c) and X7].  The above framework includes
     many forms of regression, such as linear regression (g(E[Y])=E[Y])
     and logistic regression (g(E[Y])=log(E[Y]/(1-E[Y]))). For every
     model type, we define a score function that reflects the
     ``quality'' of the model under consideration. For example, for
     linear regression the score could be the residual sum of squares
     and for logistic regression the score could be the deviance. We
     try to find the Boolean expressions in the regression model that
     minimize the scoring function associated with this model type,
     estimating the parameters bj simultaneously with the Boolean
     expressions Lj. In general, any type of model can be considered,
     as long as a scoring function can be defined. For example, we also
     implemented the Cox proportional hazards model, using the partial
     likelihood as the score.

     Since the number of possible Logic Models we can construct for a
     given set of predictors is huge, we have to rely on some search
     algorithms to help us find the best scoring models.  We define the
     move set by a set of standard operations such as splitting and
     pruning the tree (similar to the terminology introduced by Breiman
     et al (1984) for CART).  We investigated two types of algorithms:
     a greedy and a simulated annealing algorithm. While the greedy
     algorithm is very fast, it does not always find a good scoring
     model. The simulated annealing algorithm usually does, but
     computationally it is more expensive.  Since we have to be certain
     to find good scoring models, we usually carry out simulated
     annealing for our case studies. However, as usual, the best
     scoring model generally over-fits the data, and methods to
     separate signal and noise are needed.  To assess the over-fitting
     of large models, we offer the option to fit a model of a specific
     size. For the model selection itself we developed and implemented
     permutation tests and tests using cross-validation. If sufficient
     data is available, an analysis using a training and a test set can
     also be carried out.  These tests are rather complicated, so we
     will not go into detail here and refer you to Ruczinski I,
     Kooperberg C, LeBlanc ML (2003), cited below. 

     There are two alternatives to the simulated annealing algorithm.
     One is a stepwise greedy selection of models. This is when setting
     'select = 6', and yields a sequence of models from size 1 through
     a maximum size. At each time among all the models that are one
     larger than the current model the best model is selected, yielding
     a sequence of models of different sizes. Usually these models are
     not the best  possible, and, if the simulated annealing chain is
     long enough, you should expect that the models selected using
     'select = 2' are better.

     The second alternative is to run a Markov Chain Monte Carlo (MCMC)
     algorithm. This is what is done in Monte Carlo Logic Regression.
     The algorithm used is a reversible jump MCMC algorithm, due to
     Green (1995). Other than the length of the Markov chain, the only
     parameter that needs to be set is a parameter for the geometric
     prior on model size. Other than in many MCMC problems, the goal in
     Monte Carlo Logic Regression is not to yield one single best
     predicting model, but rather to provide summaries of all models.
     These are exactly the elements that are shown above as the output
     when 'select = 7'.

     *MONITORING*

     The help file for 'logreg.anneal.control', contains more
     information on how to monitor the simulated annealing optimization
     for logreg. Here is some general information.

     *Find the best scoring model of any size* '(select = 1)'

     During the iterations the following information is printed out:

       log-temp  current score  best score     acc /  rej /  sing    current parameters
         -1.000          1.494       1.494    0(  0)      0     0  2.88   -1.99    0.00
         -1.120          1.150       1.043  655( 54)    220    71  3.63    0.15   -1.82
         -1.240          1.226       1.043  555( 49)    316    80  3.83    0.05   -1.71
            ...                                                                        
         -2.320          0.988       0.980  147( 36)    759    58  3.00   -2.11    1.11
         -2.440          0.982       0.980   25( 31)    884    60  2.89   -2.12    1.24
         -2.560          0.988       0.979   35( 61)    850    51  3.00   -2.11    1.11
            ...                                                                        
         -3.760          0.964       0.964    2( 22)    961    15  2.57   -2.15    1.55
         -3.880          0.964       0.964    0( 17)    961    22  2.57   -2.15    1.55
         -4.000          0.964       0.964    0( 13)    970    17  2.57   -2.15    1.55

     _log-temp:_
       logarithm (base 10) of the temperature at the last iteration
     before the print out.
      _current score:_
       the score after the last iterations.
      _best score:_
       the single lowest score seen at any iteration.
      _acc:_
       the number of proposed moves that were accepted since the last
     print out for which the model changed, within parenthesis, the
     number of those that were identical in score to the move before
     acceptance.
      _rej:_
       the number of proposed moves that gave numerically acceptable
     results, but were rejected by the simulated annealing algorithm
     since the last print out.
      _sing:_
       the number of proposed moves that were rejected because they
     gave numerically unacceptable results, for example because they
     yielded a singular system.
      _current parameters:_
       the values of the coefficients (first for the intercept, then
     for the linear (separate) components, then for the logic trees).

     This information can be used to judge the convergence of the
     simulated annealing algorithm, as described in the help file of
     'logreg.anneal.control'.  Typically we want (i) the number of
     acceptances to be high in the beginning, (ii) the number of
     acceptances with different scores to be low at the end, and (iii)
     the number of iterations when the fraction of acceptances is
     moderate to be as large as possible.

     *Find the best scoring models for various sizes* '(select = 2)'

     During the iterations the same information as for _find the best
     scoring model of any size_, for each size model considered.

     *Carry out cross-validation for model selection* '(select = 3)'

     Information about the simulated annealing as described above can
     be printed out. Otherwise, during the cross-validation process
     information like

     ' Step  5 of 10 [ 2 trees; 4 leaves] The CV score is   1.127 
     1.120  1.052  1.122 '

     The first of the four scores is the _training_-set score on the
     current validation sample, the second score is the average of all
     the _training_-set scores that have been processed for this model
     size, the third is the _test_-set score on the current validation
     sample, and the fourth score is the average of all the _test_-set
     scores that have been processed for this model size. Typically we
     would prefer the model with the lowest test-set score average over
     all cross-validation steps.

     *Carry out a permutation test to check for signal in the data*
     '(select = 4)'

     Information about the simulated annealing as described above can
     be printed out. Otherwise, first the score of the model of size 0
     (typically only fitting an intercept) and the score of the best
     model are printed out. Then during the permutation lines like

     ' Permutation number   21  out of   50  has score       1.47777 '

     are printed. Each score is the result of fitting a logic tree
     model, on data where the response has been permuted.  Typically we
     would believe that there is signal in the data if most
     permutations have worse (higher) scores than the best model, but
     we may believe that there is substantial over-fitting going on if
     these permutation scores are much better (lower) than the score of
     the model of size 0.

     *Carry out a permutation test for model selection* '(select = 5)'

     To be able to run this option, an object of class 'logreg' that
     was run with '(select = 2)' needs to be in place.  Information
     about the simulated annealing as described above can be printed
     out. Otherwise, lines like

     ' Permutation number    8  out of   25  has score       1.00767   
     model size  3 with 2 tree(s)'

     are printed.  We can compare these scores to the tree of the same
     size and the best tree.  If the scores are about the same as the
     one for the best tree, we think that the ``true'' model size may
     be the one that is being tested, while if the scores are much
     worse, the true model size may be larger. The comparison with the
     model of the same size suggests us again how much over-fitting may
     be going on. 'plot.logreg' generates informative histograms.

     *Greedy stepwise selection of Logic Regression models* '(select =
     6)'

     The scores of the best greedy models of each size are printed. 

     *Monte Carlo Logic Regression* '(select = 7)'

     A status line is printed every so many iterations. This
     information is probably not very useful, other than that it helps
     you figure out how far the code is.

     *PARAMETERS*

     As Logic Regression is written in Fortran 77 some parameters had
     to be hard coded in. The default values of these parameters are

     maximum sample size: 10000
      maximum number of columns in the input file: 1000
      maximum number of leaves in a logic tree: 128
      maximum number of logic trees: 5
      maximum number of separate parameters: 50
      maximum number of total parameters(separate + trees): 55

     If these parameters are not large enough (an error message will
     let you know this), you need to reset them in *slogic.f* and
     recompile. In particular, the statements defining these parameters
     are 

     'PARAMETER (LGCn1MAX   = 10000)'
      'PARAMETER (LGCn2MAX   =  1000)'
      'PARAMETER (LGCnknMAX  =   128)'
       'PARAMETER (LGCntrMAX  =     5)'
       'PARAMETER (LGCnsepMAX =    50)'
       'PARAMETER (LGCbetaMAX =    55)'


     The unfortunate thing is that you will have to change these
     parameter definitions several times in the code. So search until
     you have found them all.

_V_a_l_u_e:

     An object of the class 'logreg'. This contains virtually all input
     parameters, and in addition

     If 'select = 1:'

     an object of class 'logregmodel': the Logic Regression model. This
     model contains a list of 'ntrees' objects of class 'logregtree'.

     If 'select = 2' or 'select = 6':

     'nmodels:'  the number of models fitted.
      'allscores:'  a matrix with 3 columns, containing the scores of
     all models. Column 1 contains the score, column 2 the number of
     leaves and column 3 the number of trees.
      'alltrees:'  a list with 'nmodels' objects of class
     'logregmodel'.

     If 'select = 3:'

     'cvscores:' a matrix with the results of cross validation. The
     'train.ave' and 'test.ave' columns for train and test contain
     running averages of the scores for individual validation sets. As
     such these scores are of most interest for the rows where
     'k=kfold'.

     If 'select = 4:'

     'nullscore:' score of the null-model.
      'bestscore:'  score of the best model.
      'randscores:' scores of the permutations; vector of length
     'nrep'.

     If 'select = 5:'

     'bestscore:'  score of the best model.
      'randscores:' scores of the permutations; each column corresponds
     to one model size.

     If 'select = 7:'

     'size:' a matrix with two columns, indicating which size models
     were fit how often.
      'single:' a vector with as many elements as there are binary
     predictors. 'single[i]' shows how often predictor  'i' is in any
     of the MCMC models. Note that when a predictor is twice in the
     same model, it is only counted once, thus, in particular,
     'sum(size[,1]*size[,2]' will typically be slightly larger than
     'sum(single)'.
      'double:' square matrix with as size the number of binary
     predictors. 'double[i,j]' shows how often predictors 'i' and 'j'
     are in the same tree of the same MCMC model if 'i>j', if 'i<=j'
     'double[i,j]' equals zero. Note that for models with several logic
     trees two predictors can both be in the model but not be in the
     same tree.
      'triple:' square 3D array with as size the number of binary
     predictors. See 'double', but here 'triple[i,j,k]' shows how often
      three predictors are jointly in one logic tree.
      In addition, the file 'slogiclisting.tmp'  in the current working
     directory can be created. This file contains a compact listing of
     all models visited. Column 1: proportional to the log posterior
     probability of the model; column 2: score (log-likelihood); column
     3 and 4: how often was this model visited (this column is here
     twice for historical reasons), column 5 through 4 + maximum number
     of leaves: summary of the first tree, if there are two trees,
     column 5 + maximum number of leaves through 4 + twice the maximum
     number of leaves contains the second tree, and so on.
      In this compact notation, leaves are in the same sequence as the
     rows in a 'logregtree' object; a zero means that the leave is
     empty, a 1000 means an ``and'' and a 2000 an ``or'', any other
     positive number means a predictor and a negative number means
     ``not'' that predictor.
      The 'mc.control' element 'output' can be used to surppress the
     creation of 'double', 'triple', and/or 'slogiclisting.tmp'. There
     doesn't seem to be too much use in surppressing 'double'.
     Surpressing 'triple' speeds up computations a bit (in particular
     on machines with limited memory when there are many binary
     predictors), and reduces the size of both the code and the object,
     surppressing 'slogiclisting.tmp' saves the creation of a possibly
     very large file, which can slow down the code considerably. See
     'logreg.mc.control' for details.

_A_u_t_h_o_r(_s):

     Ingo Ruczinski ingo@jhu.edu and Charles Kooperberg clk@fhcrc.org.

_R_e_f_e_r_e_n_c_e_s:

     Ruczinski I, Kooperberg C, LeBlanc ML (2003).  Logic Regression,
     _Journal of Computational and Graphical Statistics_, *12*,
     475-511.

     Ruczinski I, Kooperberg C, LeBlanc ML (2002).  Logic Regression -
     methods and software.  _Proceedings of the MSRI workshop on
     Nonlinear Estimation and Classification_ (Eds: D. Denison, M.
     Hansen, C. Holmes, B. Mallick, B. Yu), Springer: New York,
     333-344.

     Kooperberg C, Ruczinski I, LeBlanc ML, Hsu L (2001). Sequence
     Analysis using Logic Regression, _Genetic Epidemiology_, *21*,
     S626-S631.

     Kooperberg C, Ruczinki I (2005). Identifying interacting SNPs
     using Monte Carlo Logic Regression, Genetic Epidemiology, in
     press.

     Selected chapters from the dissertation of Ingo Ruczinski,
     available from <URL:
     http://bear.fhcrc.org/~ingor/logic/documents/myphd-logic.pdf>

_S_e_e _A_l_s_o:

     'eval.logreg', 'frame.logreg', 'plot.logreg', 'print.logreg',
     'predict.logreg', 'logregtree', 'plot.logregtree',
     'print.logregtree', 'logregmodel', 'plot.logregtree',
     'print.logregtree', 'logreg.myown', 'logreg.anneal.control',
     'logreg.tree.control',  'logreg.mc.control', 'logreg.testdat'

_E_x_a_m_p_l_e_s:

     data(logreg.savefit1,logreg.savefit2,logreg.savefit3,logreg.savefit4,
          logreg.savefit5,logreg.savefit6,logreg.savefit7,logreg.testdat)

     myanneal <- logreg.anneal.control(start = -1, end = -4, iter = 2500, update = 100)
     # in practie we would use 25000 iterations or more - the use of 2500 is only
     # to have the examples run fast
     ## Not run: myanneal <- logreg.anneal.control(start = -1, end = -4, iter = 25000, update = 1000)
     fit1 <- logreg(resp = logreg.testdat[,1], bin=logreg.testdat[, 2:21], type = 2,
                    select = 1, ntrees = 2, anneal.control = myanneal)
     # the best score should be in the 0.97-0.98 range
     plot(fit1)
     # you'll probably see X1-X4 as well as a few noise predictors
     # use logreg.savefit1 for the results with 25000 iterations
     plot(logreg.savefit1)
     print(logreg.savefit1)
     z <- predict(logreg.savefit1)
     plot(z, logreg.testdat[,1]-z, xlab="fitted values", ylab="residuals")
     # there are some streaks, thanks to the very discrete predictions
     #
     # a bit less output
     myanneal2 <- logreg.anneal.control(start = -1, end = -4, iter = 1000, update = 0)
     # in practie we would use 25000 iterations or more - the use of 1000 is only 
     # to have the examples run fast
     ## Not run: myanneal2 <- logreg.anneal.control(start = -1, end = -4, iter = 25000, update = 0)
     #
     # fit multiple models
     fit2 <- logreg(resp = logreg.testdat[,1], bin=logreg.testdat[, 2:21], type = 2,
                   select = 2, ntrees = c(1,2), nleaves =c(1,7), anneal.control = myanneal2)
     # equivalent
     fit2 <- logreg(select = 2, ntrees = c(1,2), nleaves =c(1,7), oldfit = fit1,
                    anneal.control = myanneal2)
     plot(fit2)
     print(fit2)
     # use logreg.savefit2 for the results with 25000 iterations
     plot(logreg.savefit2)
     print(logreg.savefit2)
     # After an initial steep decline, the scores only get slightly better
     # for models with more than four leaves and two trees. 
     #
     # cross validation
     fit3 <- logreg(resp = logreg.testdat[,1], bin=logreg.testdat[, 2:21], type = 2,
                    select = 3, ntrees = c(1,2), nleaves=c(1,7), anneal.control = myanneal2)
     # equivalent
     fit3 <- logreg(select = 3, oldfit = fit2)
     plot(fit3)
     # use logreg.savefit3 for the results with 25000 iterations
     plot(logreg.savefit3)
     # 4 leaves, 2 trees should top
     # null model test
     fit4 <- logreg(resp = logreg.testdat[,1], bin=logreg.testdat[, 2:21], type = 2,
                    select = 4, ntrees = 2, anneal.control = myanneal2)
     # equivalent
     fit4 <- logreg(select = 4, anneal.control = myanneal2, oldfit = fit1)
     plot(fit4)
     # use logreg.savefit4 for the results with 25000 iterations
     plot(logreg.savefit4)
     # A histogram of the 25 scores obtained from the permutation test. Also shown
     # are the scores for the best scoring model with one logic tree, and the null
     # model (no tree). Since the permutation scores are not even close to the score
     # of the best model with one tree (fit on the original data), there is overwhelming
     # evidence against the null hypothesis that there was no signal in the data. 
     fit5 <- logreg(resp = logreg.testdat[,1], bin=logreg.testdat[, 2:21], type = 2,
                    select = 5, ntrees = c(1,2), nleaves=c(1,7), anneal.control = myanneal2,
                    nrep = 10, oldfit = fit2)
     # equivalent
     fit5 <- logreg(select = 5, nrep = 10, oldfit = fit2)
     plot(fit5)
     # use logreg.savefit5 for the results with 25000 iterations and 25 permutations
     plot(logreg.savefit5)
     # The permutation scores improve until we condition on a model with two trees and
     # four leaves, and then do not change very much anymore. This indicates that the
     # best model has indeed four leaves.
     #
     # greedy selection
     fit6 <- logreg(select = 6, ntrees = 2, nleaves =c(1,12), oldfit = fit1)
     plot(fit6)
     # use logreg.savefit6 for the results with 25000 iterations
     plot(logreg.savefit6)
     #
     # Monte Carlo Logic Regression
     fit7 <- logreg(select = 7, oldfit = fit1, mc.control=
                    logreg.mc.control(nburn=1000, niter=10000, hyperpars=log(2)))
     # we need many more iterations for reasonable results
     ## Not run: 
     logreg.savefit7 <- logreg(select = 7, oldfit = fit1, mc.control=
                    logreg.mc.control(nburn=1000, niter=100000, hyperpars=log(2)))
     ## End(Not run)
     #
     plot(fit7)
     # use logreg.savefit7 for the results with 25000 iterations
     plot(logreg.savefit7)

