simulator               package:boost               R Documentation

_s_i_m_u_l_a_t_o_r

_D_e_s_c_r_i_p_t_i_o_n:

     Simulation of (microarray) data according to correlation and mean
     structures from real datasets.

_U_s_a_g_e:

     simulator(x, y, respmod = c("none", "resp1", "resp2", "resp3"),
     nos = 1200, gene = NULL, signs = NULL)

_A_r_g_u_m_e_n_t_s:

       x: A (n x p)-matrix, whose correlation and mean structure is to
          be used for simulating data. Its rows correspond to training
          instances and columns contain the predictor variables.

       y: A vector of length n containing the class labels, which need
          to be coded by 0 and 1.

 respmod: A character string. Either "none" where the simulated gene
          expression labels are determined model-free depending which
          class mean and correlation structure had been used for their
          determination. The choice of "resp1", "resp2" and "resp3"
          means that a response model is applied. For "resp1", 10 genes
          are selected and determine conditional proabilities via a
          logistic model with equal weights. The class labels are then
          regarded as having a Bernoulli distribution with probability
          p. For "resp2", 25 genes are plugged into the logistic model
          with non-equal weights. With "resp3", 25 genes are chosen for
          a logistic model with second and third order interactions.

     nos: An integer, giving the number of instances which are
          simulated.

    gene: A vector giving the index of the genes which shall be used
          for model based class label simulation. Defaults to NULL.
          This argument should only be used for specially designed
          simulation studies, where it is important that the same
          predictor variables are repeatedly used for simulating class
          label.

   signs: A vector containing entries of +1 and -1. Defaults to NULL
          and is only of importance in specially designed simulation
          studies, where it is important that the same predictor
          variables are repeatedly used for simulating class label. 

_D_e_t_a_i_l_s:

     The new instances are simulated according to a multivariate normal
     distribution with means and correlation structure taken from a
     real (gene expression) dataset. This structure is obtained by
     transforming a standard multivariate normal distribution, which
     requires a eigenvalue decomposition of the provided real dataset.
     For datasets with many predictors (>500), this can be fairly time
     consuming. Simulating data without applying a response model is
     fine for most purposes, only special analysis tasks usually
     require it.

_V_a_l_u_e:

     Returns a list containing 

       x: An (nos x p)-matrix, containing the simulated data

       y: A vector of length nos, containing the class labels of the
          simulated data.

  probab: A vector of length nos, containing the conditional
          probabilities of the simulated data. Is empty if
          respmod="none".

   bayes: An integer, giving the Bayes error (theoretically minimal
          misclassification risk) for the simulated data. Is empty if 
          respmod="none".

    gene: A vector, containing the indices of the variables which had
          been used in the logistic model for either "resp1", "resp2"
          or "resp3". Is empty if respmod="none".

   signs: A vector, containing -1 and +1. Indicates with what
          polarization a predictor variable had been used in the
          logistic model. Is empty if respmod="none".

     b

_R_e_f_e_r_e_n_c_e_s:

_o "BagBoosting for Tumor Classification with Gene Expression Data",
     Marcel Dettling. To appear in Bioinformatics (2005).

_o Further information is available from the webpage <URL:
     http://stat.ethz.ch/~dettling>

_E_x_a_m_p_l_e_s:

     set.seed(21)
     data(leukemia)

     ## Simulation of gene expression data
     simu <- simulator(leukemia.x, leukemia.y, nos=200)

     ## Defining training and test data
     xlearn <- simu$x[1:150,]
     ylearn <- simu$y[1:150]
     xtest  <- simu$x[151:200,]
     ytest  <- simu$y[151:200]

     ## Classification with logitboost
     fit <- logitboost(xlearn, ylearn, xtest, mfinal=20, presel=50)
     summarize(fit, ytest)

