reclassify              package:perturb              R Documentation

_C_a_l_l_e_d _b_y _p_e_r_t_u_r_b _t_o _c_a_l_c_u_l_a_t_e _r_e_c_l_a_s_s_i_f_i_c_a_t_i_o_n _t_a_b_l_e_s

_D_e_s_c_r_i_p_t_i_o_n:

     'reclassify' is called by 'perturb' to calculate reclassification
     probabilities for categorical variables. Use separately to
     experiment with reclassification probabilities.

_U_s_a_g_e:

     reclassify(varname, pcnt = NULL, adjust = TRUE, bestmod = TRUE,
     min.val = .1, diag = NULL, unif = NULL, dist = NULL, assoc = NULL)

     ## S3 method for class 'reclassify':
     print(x, dec.places = 3, full = FALSE, ...)

_A_r_g_u_m_e_n_t_s:

 varname: a factor to be reclassified

    pcnt: initial reclassification percentages

  adjust: makes the expected frequency distribution of the reclassified
          variable equal to that of the original

 bestmod: imposes an appropriate pattern of association between the
          original and the reclassified variable

 min.val: value to add to empty cells of the initial expected table
          when estimating the best model

    diag: The odds of same versus different category reclassification

    unif: Controls short distance versus long distance reclassification
          for ordered variables

    dist: alternative parameter for short versus long distance
          reclassification

   assoc: a matrix defining a loglinear pattern of association

       x: a 'reclassify' object to be printed

dec.places: number of decimal places to use when printing

    full: if TRUE, some extra information is printed

     ...: arguments to be passed on to or from other methods. Print
          options for class 'matrix' may be used, e.g. 'print.gap'

_D_e_t_a_i_l_s:

     'reclassify' creates a table of reclassification probabilities for
     _varname_. By default, the reclassification probabilities are
     defined so that the expected frequency distribution of the
     reclassified variable is identical to that of the original. In
     addition, a meaningful pattern of association is imposed between
     the original and the reclassified variable.'reclassify' is called
     by 'perturb' to calculate reclassification probabilities for
     categorical variables. 'reclassify' can be used separately to find
     a suitable reclassification probabilities.

     'Reclassify' has several options but the most relevant will
     generally be the 'pcnt' option. The argument for 'pcnt' can be


        *  a scalar

        *  a vector of length n

        *  a vector of length n^2, where n is the number of categories
           of the variable to be reclassified.

     If the argument for 'pcnt' is a scalar, its value is taken to be
     the percentage of cases to be reclassified to the same category,
     which is the same for all categories. A table of initial
     reclassification probabilities for the original by the
     reclassified variable is created with this value divided by 100 on
     the diagonal and equal values on off-diagonal cells.

     If the argument for 'pcnt' is a vector of length n, its values
     indicate the percentage to be reclassified to the same category
     for each category separately. These values divided by 100 form the
     diagonal of the table of initial reclassification probabilities.
     Off-diagonal cells have the same values for rows so that the row
     sum is equal to 1.

     If the argument for 'pcnt' is a vector of length n^2, its values
     form the table of initial reclassification probabilities.
     'prop.table' is used to ensure that these values sum to 1 over the
     columns. Specifying a complete table of initial reclassification
     probabilities will be primarily useful when an ordered variable is
     being reclassified.

     'Reclassify' prints an initial table of reclassification
     probabilities based on the 'pcnt' option. This table is not used
     directly though but _adjusted_ to make the expected frequencies of
     the reclassified variable identical to those of the original. In
     addition, a meaningful pattern of association is imposed between
     the original and the reclassified variable. Details are given in
     the section _Adjusting the reclassification probabilities_.

     Knowledgeable users can specify a suitable pattern of association
     directly, bypassing the pcnt option. Details are given in the
     section _Specifying a pattern of association directly_.

_V_a_l_u_e:

     An object of class 'reclassify'. By default, 'print.reclassify'
     prints the variable name and the 'reclass.prob'. If the 'full'
     option is used with 'print.reclassify', additional information
     such as the initial reclassification probabilities, initial
     expected table, best model, are printed as well.

variable: The variable specified

reclass.prob: Row-wise proportions of 'fitted.table'

cum.reclass.prob: Cumulative row-wise proportions

exptab$init.pcnt: initial reclassification probabilities  (option
          'pcnt')

exptab$init.tbl: initial expected frequencies (option 'pcnt')

 bestmod: The best model found for the table of initial expected
          frequencies (option 'pcnt')

   assoc: The log pattern of association specified using 'pcnt' and
          'bestmod=FALSE'

    coef: The coefficients of a fitted loglinear model

fitted.table: The adjusted table of expected frequencies

_A_d_j_u_s_t_i_n_g _t_h_e _r_e_c_l_a_s_s_i_f_i_c_a_t_i_o_n _p_r_o_b_a_b_i_l_i_t_i_e_s:

     A problem with the initial reclassification probabilities created
     using 'pcnt' is that the expected frequencies of the reclassified
     variable will not be the same as those of the original. Smaller
     categories will become larger in the expected frequencies, larger
     categories will become smaller. This can be seen in the column
     marginal of the initial table of expected frequencies in the
     'reclassify' output. This could have a strong impact on the
     standard errors of reclassified variables, particularly as
     categories differ strongly in size.

     To avoid this, the initial expected table is _adjusted_ so that
     the column margin is the same as the row margin, i.e. the expected
     frequencies of the reclassified variable are the same as those of
     the original. Use 'adjust=FALSE' to skip this step. In that case
     the initial reclassification probabilities are also the final
     reclassification probabilities.

     A second objection to the initial reclassification probabilities
     is that the pattern of association between the original and the
     reclassified variable is arbitrary. The association between some
     combinations of categories is higher than for others. 'Reclassify'
     therefore derives an appropriate pattern of association for the
     initial expected table of the original by reclassified variable.
     This pattern of association is used when adjusting the marginals
     to make the frequency distribution of the reclassified variable
     identical to that of the original. Use the option 'bestmod=FALSE'
     to skip this step.

     The patterns of association used by reclassify are drawn from
     loglinear models for square tables, also known as mobility
     models (Goodman 1984, Hout 1983). Many texts on loglinear
     modelling contain a brief discussion of such models as well. For
     unordered variables, a quasi-independent pattern of association
     would be appropriate. Under quasi-independent association, the row
     variable is independent of the column variable if the diagonal
     cells are ignored.

     If the argument for 'pcnt' was a scalar, 'reclassify' fits a
     quasi-independent (constrained) model. This model has a single
     parameter 'diag' which indicates the log-odds of same versus
     different reclassification. This log-odds is the same for all
     categories. If the argument was of vector of length n, then a
     regular quasi-independence model is fitted with parameters 'diag1'
     to 'diag\emph{n}'. These parameters indicate the log-odds of same
     versus different category reclassification, which is different for
     each category. For both models, the reclassified category is
     independent of the original category if the diagonal cells are
     ignored.

     If the argument for 'pcnt' was a vector of length n^2,
     'reclassify' fits two models, a quasi-distance model and a
     quasi-uniform association model, and selects the one with the
     best fit to the initial expected table. Both have the 'diag'
     parameter of the quasi-independence (constrained) model. An
     additional parameter is added to make short distance
     reclassification more likely than long distance reclassification.
     The quasi-uniform model is stricter: it makes reclassification
     less likely proportionately to the squared difference between the
     two categories. The distance model makes reclassification less
     likely proportionately to the absolute difference between the two
     categories.

     In some cases, the initial expected table based on the 'pcnt'
     option contains empty cells. To avoid problems when estimating the
     best model for this table, a value of .1 is added to these cells.
     Use the 'min.val' option to specify a different value.

_S_p_e_c_i_f_y_i_n_g _a _p_a_t_t_e_r_n _o_f _a_s_s_o_c_i_a_t_i_o_n _d_i_r_e_c_t_l_y:

     If the 'pcnt' option is used, 'reclassify' automatically
     determines a suitable pattern of association between the original
     and the reclassified variable. Knowledgeable users can also
     specify a pattern of association directly. The final
     reclassification probabilities will then be based on these values.
     Built-in options for specifying the loglinear parameters of
     selected mobility models are:


     _d_i_a_g quasi-independence constrained  (same versus different
          category reclassification)

     _u_n_i_f uniform association (long versus short distance
          reclassification for ordered categories)

     _d_i_s_t linear distance model (allows more long distance
          reclassification than uniform association)

     The 'assoc' option can be used to specify an association pattern
     of one's own choice. The elements of 'assoc' should refer to
     matrices with an appropriate loglinear pattern of association.
     Such matrices can be created in many ways. An efficient method is:

     'wrk<-diag(table('_factor_'))'
      'myassoc<-abs(row(wrk)-col(wrk))*-log(5)'

     This creates a square diagonal matrix called 'wrk' with the same
     number of rows and columns as the levels of _factor_. 'row(wrk)'
     and 'col(wrk)' can now be used to define a loglinear pattern of
     association, in this case a distance model with parameter 5.
     'reclassify' checks the length of the matrix equals n^2, where _n_
     is the number of categories of 'varname' and ensures that the
     pattern of association is symmetric.

_I_m_p_o_s_i_n_g _a _p_a_t_t_e_r_n _o_f _a_s_s_o_c_i_a_t_i_o_n:

     A table with given margins and a given pattern of association can
     be created by \itemize {

     *  estimating a loglinear model of independence for a table with
        the desired margins

     *  while specifying the log pattern of association as an offset
        variable (cf. Kaufman & Schervish (1986), Hendrickx (2004). }
        The body of the table is unimportant as long as it has the
        appropriate margins. The predicted values of the model form a
        table with the desired properties.

        The expected table of the original by the reclassified variable
        is adjusted by creating a table with the frequency distribution
        of the original variable on the diagonal cells. This table then
        has the same marginals for the row and column variables. The
        pattern of association is determined by the reclassify options.
        If 'pcnt'  is used and 'bestmod=TRUE' then the predicted values
        of the best model are used as the offset variable. If
        'bestmod=FALSE', the log values of the initial expected table
        are made symmetric and used as the offset variable. If a
        loglinear model was specified directly, a variable is created
        in the manner of the 'assoc' example.

        A small modification in procedure is that reclassify uses a
        model of equal main effects rather than independence. Since the
        pattern of association is always symmetric, the created table
        will then also be exactly symmetric with the frequency
        distribution of the original variable as row and column
        marginal.

_C_h_a_n_g_e_s _f_r_o_m _v_e_r_s_i_o_n _1:

     Version 1 was not made available from CRAN and so I felt justified
     in not making version 2 entirely backward compatible. The
     'misclass' option has been dropped; replace 'misclass=5)' by
     'pcnt=95,adjust=FALSE'. Replace 'q=2' by 'diag=log(2)' and 'u=3'
     by 'unif=log(3)'.

_A_u_t_h_o_r(_s):

     John Hendrickx John_Hendrickx@yahoo.com

_R_e_f_e_r_e_n_c_e_s:

     Goodman, Leo A. (1984). The analysis of cross-classified data
     having ordered categories. Cambridge, Mass.: Harvard University
     Press.

     Hendrickx, J. (2004). Using standardised tables for interpreting
     loglinear models. Quality & Quantity 38: 603-620.

     Hendrickx, John, Ben Pelzer. (2004). Collinearity involving
     ordered and unordered categorical variables. Paper presented at
     the RC33 conference in Amsterdam, August 17-20 2004. Available at
     <URL: http://www.xs4all.nl/~jhckx/perturb/>

     Hout, M. (1983). Mobility tables. Beverly Hills: Sage
     Publications.

     Kaufman, R.L., & Schervish, P.G. (1986). Using adjusted
     crosstabulations to interpret log-linear relationships. American
     Sociological Review 51:717-733

_S_e_e _A_l_s_o:

     'perturb', 'colldiag', '[car]''vif', '[Design]''vif'

_E_x_a_m_p_l_e_s:

     library(car)
     data(Duncan)
     attach(Duncan)

     reclassify(type,pcnt=95)

