Robust Multinomial Regression  package:multinomRob  R Documentation

_M_u_l_t_i_n_o_m_i_a_l _R_o_b_u_s_t _E_s_t_i_m_a_t_i_o_n

_D_e_s_c_r_i_p_t_i_o_n:

     'multinomRob' fits the overdispersed multinomial regression model
     for grouped count data using the hyperbolic tangent (tanh) and
     least quartile difference (LQD) robust estimators.

_U_s_a_g_e:

       multinomRob(model, data, starting.values=NULL,  equality=NULL,
                   genoud.parms=NULL, print.level=0, iter = FALSE,
                   maxiter = 10, multinom.t=1, multinom.t.df=NA)

_A_r_g_u_m_e_n_t_s:

   model: The regression model specification.  This is a list of
          formulas, with one formula for each category of outcomes for
          which counts have been measured for each observation.  For
          example, in the following,

          'model=list(y1 ~ x1, y2 ~ x2, y3 ~ 0)'

          the outcome variables containing counts are 'y1', 'y2' and
          'y3', and the linear predictor for 'y1' is a coefficient
          times 'x1' plus a constant, the linear predictor for 'y2' is
          a coefficient times 'x2' plus a constant, and the linear
          predictor for 'y3' is zero.  Each formula has the format
          'countvar ~ RHS', where 'countvar' is the name of a vector,
          in the dataframe referenced by the 'data' argument, that
          gives the counts for all observations for one category. 
          'RHS' denotes the righthand side of a formula using the usual
          syntax for formulas, where each variable in the formula is
          the name of a vector in the dataframe referenced by the
          'data' argument. For example, a 'RHS' specification of 'var1
          + var2*var3' would specify that the regressors are to be
          'var1', 'var2', 'var3', the terms generated by the
          interaction 'var2:var3', and the constant.

          The set of outcome alternatives may be specified to vary over
          observations, by putting in a negative value for alternatives
          that do not exist for particular observations.  If the value
          of an outcome variable is negative for an observation, then
          that outcome is considered not available for that
          observation.  The predicted counts for that observation are
          defined only for the available observations and are based on
          the linear predictors for the available observations.  The
          same set of coefficient parameter values are used for all
          observations.  Any observation for which fewer than two
          outcomes are available is omitted.

          Observations with missing data ('NA') in any outcome variable
          or regressor are omitted (listwise deletion).

          In a model that has the same regressors for every category,
          except for one category for which there are no regressors in
          order to identify the model (the reference category), the
          'RHS' specification must be given for all the categories
          except the reference category.  The formula for the reference
          category must include a 'RHS' specification that explicitly
          omits the constant, e.g., 'countvar ~ -1' or 'countvar ~ 0'. 
          The number of coefficient parameters to be estimated equals
          the number of terms generated by all the formulas, subject to
          equality constraints that may be specified using the
          'equality' argument.

    data: The dataframe that contains all the variables referenced in
          the 'model' argument, which are the data to be analyzed.

starting.values: Starting values for the regression coefficient
          parameters, as a vector. The parameter ordering matches the
          ordering of the formulas in the 'model' argument:  parameters
          for the terms in the first formula appear first, then come
          parameters for the terms in the second formula, etc.  In
          practice it will usually be better to start by letting
          multinomRob find starting values by using the 'multinom.t'
          option, then using the results from one run as starting
          values for a subsequent run done with, perhaps, a larger
          population of operators for rgenoud.

equality: List of equality constraints.  This is a list of lists of
          formulas.  Each formula has the same format as in the model
          specification, and must include only a subset of the outcomes
          and regressors used in the model specification formulas.  All
          the coefficients specified by the formulas in each list will
          be constrained to have the same value during estimation.  For
          example, in the following,

          'multinomRob(model=list(y1 ~ x1, y2 ~ x2, y3 ~ 0), data=dtf,
          equality=list(list(y1 ~ x1 + 0, y2 ~ x2 + 0)) );'

          the model to be estimated is

          'list(y1 ~ x1, y2 ~ x2, y3 ~ 0)'

          and the coefficients of x1 and x2 are constrained equal by

          'equality=list(list(y1 ~ x1 + 0, y2 ~ x2 + 0))'

          In the equality formulas it is necessary to say '+ 0' so the
          intercepts are not involved in the constraints.  If a
          parameter occurs in two different lists in the 'equality='
          argument, then all the parameters in the two lists are
          constrained to be equal to one another.  In the output this
          is described as consolidating the lists.

genoud.parms: List of named arguments used to control the rgenoud
          optimizer, which is used to compute the LQD estimator.

print.level: Specify 0 for minimal printing, 1 to print more detailed
          information about LQD and other intermediate computations, 2
          to print details about the tanh computations, or 3 to print
          details about starting values computations.

    iter: 'TRUE' means to iterate between LQD and tanh estimation steps
          until either the algorithm converges, the number of
          iterations specified by the 'maxiter' argument is reached, or
          if an LQD step occurs that produces a larger value than the
          previous step did for the overdispersion scale parameter. 
          This option is often improves the fit of the model.

 maxiter: The maximum number of iterations to be done between LQD and
          tanh estimation steps.

multinom.t: '1' means use the multinomial multivariate-t model to
          compute starting values for the coefficient parameters.  But
          if the MNL results are better (as judged by the LQD fit), MNL
          values will be used instead.  '0' means use nonrobust maximum
          likelihood estimates for a multinomial regression model.  '2'
          forces the use of the multivariate-t model for starting
          values even if the MNL estimates provide better starting
          values for the LQD.  Note that with 'multinom.t=1' or
          'multinom.t=2', multivariate-t starting values will not be
          used if the model cannot generate valid standard errors.  To
          force the use of multivariate-t estimates even in this
          circumstance, see the 'multinom.t.df' argument.

          If the 'starting.values' argument is not 'NULL', the starting
          values given in that argument are used and the 'multinom.t'
          argument is ignored.  Multinomial multivariate-t starting
          values are not available when the number of outcome
          alternatives varies over the observations.

multinom.t.df: 'NA' means that the degrees of freedom (DF) for the
          multivariate-t model (when used) should be estimated.  If
          'multinom.t.df' is a number, that number will be used for the
          degrees of freedom and the DF will not be estimated.  Only a
          positive number should be used. Setting 'multinom.t.df' to a
          number also implies that, if 'multinom.t=1' or
          'multinom.t=2', the multivariate-t starting values will be
          used (depending on the comparison with the MNL estimates if
          'multinom.t=1' is set) even if the standard errors are not
          defined. 

_D_e_t_a_i_l_s:

     The tanh estimator is a redescending M-estimator, and the LQD
     estimator is a generalized S-estimator.  The LQD is used to
     estimate the scale of the overdispersion.  Given that scale
     estimate, the tanh estimator is used to estimate the coefficient
     parameters of the linear predictors of the multinomial regression
     model. 

     If starting values are not supplied, they are computed using a
     multinomial multivariate-t model.  The program also computes and
     reports nonrobust maximum likelihood estimates for the multinomial
     regression model, reporting sandwich estimates for the standard
     errors that are adjusted for a nonrobust estimate of the error
     dispersion.

_V_a_l_u_e:

     multinomRob returns a list of 15 objects.  The returned objects
     are:

coefficients: The tanh coefficient estimates in matrix format.  The
          matrix has one column for each formula specified in the
          'model' argument.  The name of each column is the name used
          for the count variable in the corresponding formula.  The
          label for each row of the matrix gives the names of the
          regressors to which the coefficient values in the row apply. 
          The regressor names in each label are separated by a forward
          slash (/), and 'NA' is used to denote that no regressor is
          associated with the corresponding value in the matrix.  The
          value 0 is used in the matrix to fill in for values that do
          not correspond to a 'model' formula regressor.

      se: The tanh coefficient estimate standard errors in matrix
          format.  The format and labelling used for the matrix is the
          same as is used for the 'coefficients'.  The standard errors
          are derived from the estimated asymptotic sandwich covariance
          estimate.

LQDsigma2: The LQD dispersion (variance) parameter estimate.  This is
          the LQD estimate of the scale value, squared.

TANHsigma2: The tanh dispersion parameter estimate.

 weights: The matrix of tanh weights for the orthogonalized residuals. 
          The matrix has one row for each observation in the data and
          as many columns as there are formulas specified in the
          'model' argument.  The first column of the matrix has names
          for the observations, and the remaining columns contain the
          weights.  Each of the latter columns has a name derived from
          the name of one of the count variables named in the 'model'
          argument.  If 'count1' is the name of the count variable used
          in the first formula, then the second column in the matrix is
          named 'weights:count1', etc.

          If an observation has negative values specified for some
          outcome variables, indicating that those outcome alternatives
          are not available for that observation, then values of 'NA'
          appear in the weights matrix for that observation, as many
          'NA' values as there are unavailable alternatives.  The 'NA'
          values will be the last values in the affected row of the
          weights matrix, regardless of which outcome alternatives were
          unavailable for the observation.

   Hdiag: Weights used to fully studentize the orthogonalized
          residuals.  The matrix has one row for each observation in
          the data and as many columns as there are formulas specified
          in the 'model' argument.  The first column of the matrix has
          names for the observations, and the remaining columns contain
          the weights.  Each of the latter columns has a name derived
          from the name of one of the count variables named in the
          'model' argument.  If 'count1' is the name of the count
          variable used in the first formula, then the second column in
          the matrix is named 'Hdiag:count1', etc.

          If an observation has negative values specified for some
          outcome variables, indicating that those outcome alternatives
          are not available for that observation, then values of 0
          appear in the weights matrix for that observation, as many 0
          values as there are unavailable alternatives.  Values of 0
          that are created for this reason will be the last values in
          the affected row of the weights matrix, regardless of which
          outcome alternatives were unavailable for the observation.

    prob: The matrix of predicted probabilities for each category for
          each observation based on the tanh coefficient estimates.

residuals.rotate: Matrix of studentized residuals which have been made
          comparable by rotating each choice category to the first
          position.  These residuals, unlike the student and standard
          residuals below, are no longer orthogonalized because of the
          rotation.  These are the residuals displayed in Table 6 of
          the reference article.

residuals.student: Matrix of fully studentized orthogonalized
          residuals.

residuals.standard: Matrix of orthogonalized residuals, standardized by
          dividing by the overdispersion scale.

     mnl: List of nonrobust maximum likelihood estimation results from
          function 'multinomMLE'.

multinomT: List of multinomial multivariate-t estimation results from
          function 'multinomT'.

  genoud: List of LQD estimation results obtained by rgenoud
          optimization, from function 'genoudRob'.

   mtanh: List of tanh estimation results from function 'mGNtanh'.

   error: Exit error code, usually from function 'mGNtanh'.

    iter: Number of LQD-tanh iterations.

_A_u_t_h_o_r(_s):

     Walter R. Mebane, Jr., Cornell University, wrm1@cornell.edu, <URL:
     http://macht.arts.cornell.edu/wrm1/> 

     Jasjeet S. Sekhon, UC Berkeley, sekhon@berkeley.edu, <URL:
     http://sekhon.polisci.berkeley.edu/>

_R_e_f_e_r_e_n_c_e_s:

     Walter R. Mebane, Jr. and  Jasjeet Singh Sekhon. 2004.  ``Robust
     Estimation and Outlier Detection for Overdispersed Multinomial
     Models of Count Data.''  _American Journal of Political Science_
     48 (April): 391-410 . <URL:
     http://macht.arts.cornell.edu/wrm1/multinom.pdf>

     For the most current code and related material see <URL:
     http://sekhon.polisci.berkeley.edu/robust/>

_E_x_a_m_p_l_e_s:

     # make some multinomial data
     x1 <- rnorm(50);
     x2 <- rnorm(50);
     p1 <- exp(x1)/(1+exp(x1)+exp(x2));
     p2 <- exp(x2)/(1+exp(x1)+exp(x2));
     p3 <- 1 - (p1 + p2);
     y <- matrix(0, 50, 3);
     for (i in 1:50) {
       y[i,] <- rmultinomial(1000, c(p1[i], p2[i], p3[i]));
     }

     # perturb the first 5 observations
     y[1:5,c(1,2,3)] <- y[1:5,c(3,1,2)];
     y1 <- y[,1];
     y2 <- y[,2];
     y3 <- y[,3];

     # put data into a dataframe
     dtf <- data.frame(x1, x2, y1, y2, y3);

     ## Set parameters for Genoud
     zz.genoud.parms <- list( pop.size             = 1000,
                             wait.generations      = 10,
                             max.generations       = 100,
                             scale.domains         = 5,
                             print.level = 0
                             )

     # estimate a model, with "y3" being the reference category
     # true coefficient values are:  (Intercept) = 0, x = 1
     # impose an equality constraint
     # equality constraint:  coefficients of x1 and x2 are equal
     mulrobE <- multinomRob(list(y1 ~ x1, y2 ~ x2, y3 ~ 0),
                           dtf,
                           equality = list(list(y1 ~ x1 + 0, y2 ~ x2 + 0)),
                           genoud.parms = zz.genoud.parms,
                           print.level = 3, iter=FALSE);
     summary(mulrobE, weights=TRUE);

