whatif                package:WhatIf                R Documentation

_C_o_u_n_t_e_r_f_a_c_t_u_a_l _E_v_a_l_u_a_t_i_o_n

_D_e_s_c_r_i_p_t_i_o_n:

     Implements the methods described in King and Zeng (2006a, 2006b)
     for evaluating counterfactuals.

_U_s_a_g_e:

     whatif(formula = NULL, data, cfact, range = NULL, freq = NULL, nearby = NULL, 
     miss = "list", return.inputs = FALSE, ...)

_A_r_g_u_m_e_n_t_s:

 formula: An optional formula without a dependent variable that is of
          class 'formula' and that follows standard 'R' conventions for
          formulas, e.g. ~ x1 + x2.  Allows you to transform or
          otherwise re-specify combinations of the variables in both
          'data' and 'cfact'.  To use this parameter, both 'data' and
          'cfact' must be coercable to data frames; the variables of
          both 'data' and 'cfact' must be labeled; and all variables
          appearing in 'formula' must also appear in both 'data' and
          'cfact'.  Otherwise, errors are returned.  The intercept is
          automatically dropped.  Default is NULL.

    data: May take one of the following forms:  

             1.  A R model output object, such as the output from calls
                to 'lm', 'glm', and 'zelig'.  Such an output object
                must be a list.  It must additionally have either a
                formula or terms component and either a data or model
                component; if it does not, an error is returned.  Of
                the latter, 'whatif' first looks for 'data', which
                should contain either the original data set supplied as
                part of the model call (as in 'glm') or the name of
                this data set (as in 'zelig'), which is assumed to
                reside in the global environment.  If 'data' does not
                exist, 'whatif' then looks for 'model', which should
                contain the model frame (as in 'lm').  The intercept is
                automatically dropped from the extracted _observed
                covariate data_ set if the original model included one.                  

             2.  A n-by-k non-character (logical or numeric) matrix or
                data frame of _observed covariate data_ with n data
                points or units and k covariates.  All desired variable
                transformations and interaction terms should be
                included in this set of k covariates unless 'formula'
                is alternatively used to produce them.  However, an
                intercept should not be.  Such a matrix may be obtained
                by passing model output (e.g., output from a call to
                'lm') to 'model.matrix' and excluding the intercept
                from the resulting matrix if one was fit.  Note that
                'whatif' will attempt to coerce data frames to their
                internal numeric values.  Hence, data frames should
                only contain logical, numeric, and factor columns;
                character columns will lead to an error being returned.

             3.  A string.  Either the complete path (including file
                name) of the file containing the data or the path
                relative to your working directory.  This file should
                be a white space delimited text file. If it contains a
                header, you must include a column of row names as
                discussed in the help file for the 'R' function
                'read.table'.  The data in the file should be as
                otherwise described in (2).

   cfact: A 'R' object or a string.  If a 'R' object, a m-by-k
          non-character matrix or data frame of _counterfactuals_ with
          m counterfactuals and the same k covariates (in the same
          order) as in 'data'.  However, if 'formula' is used to select
          a subset of the k covariates, then 'cfact' may contain either
          only these j <=q k covariates or the complete set of k
          covariates.  An intercept should not be included as one of
          the covariates.  It will be automatically dropped from the
          counterfactuals generated by 'Zelig' if the original model
          contained one.  Data frames will again be coerced to their
          internal numeric values if possible. If a string, either the
          complete path (including file name) of the file containing
          the counterfactuals or the path relative to your working
          directory.  This file should be a white space delimited text
          file.  See the discussion under 'data' for instructions on
          dealing with a header.  All counterfactuals should be fully
          observed: if you supply counterfactuals with missing data,
          they will be list-wise deleted and a warning message will be
          printed to the screen.

   range: An optional numeric vector of length k, where k is  the
          number of covariates).  Each element represents the range of
          the corresponding covariate for use in calculating Gower
          distances.  Use this argument when covariate data do not
          represent the population of interest, such as selection by
          stratification or experimental manipulation. By default, the
          range of each covariate is calculated from the data (the
          difference of its maximum and minimum values in the sample),
          which is appropriate when a simple random sampling design was
          used.  To supply your own range for the kth covariate, set
          the kth element of the vector equal to the desired range and
          all other elements equal to NA.  Default is NULL.

    freq: An optional numeric vector of any positive length, the
          elements of which comprise a set of Gower distances.  Used in
          calculating cumulative frequency distributions for the
          distances of the data points from each counterfactual.  For
          each such Gower distance and counterfactual, the cumulative
          frequency is the fraction of observed covariate data points
          with Gower distance to the counterfactual less than or equal
          to the supplied Gower distance value.  By default,
          frequencies are calculated for the sequence of Gower
          distances from 0 to 1 in increments of 0.05.  Default is
          NULL.

  nearby: An optional scalar; the cutoff Gower distance value
          indicating which observed data points are considered to be
          nearby the counterfactuals.  Used to calculate the summary
          statistic returned by the function: the fraction of the
          observed data nearby each counterfactual.  By default, the
          geometric variance of the covariate data is used.  For
          example, setting 'nearby' to 0.11 will identify the
          proportion of data points within 0.11 of a counterfactual. 
          Default is NULL.

    miss: An optional string indicating the strategy for dealing with
          missing data in the observed covariate data set. 'whatif'
          supports two possible missing data strategies: "list",
          list-wise deletion of missing cases; and "case", ignoring
          missing data case-by-case.  Note that if "case" is selected,
          cases with missing values are deleted listwise for the convex
          hull test but simply ignored in computing the Gower
          distances.  Default is "list".

return.inputs: A Boolean; should the processed observed covariate and
          counterfactual data matrices on which all 'whatif'
          computations are performed be returned?  Processing refers to
          internal 'whatif' operations such as the subsetting of
          covariates via 'formula', the deletion of cases with missing
          values, and the coercion of data frames to numeric matrices.
          Primarily intended for diagnostic purposes.  If TRUE, these
          matrices are returned as a list.  Default is FALSE.

     ...: Further arguments passed to and from other methods.

_D_e_t_a_i_l_s:

     This function is the primary tool for evaluating your
     counterfactuals.   Specifically, it:

        1.  Determines whether or not your counterfactuals are in the
           convex hull of the observed covariate data.  

        2.  Computes the distance of your counterfactuals from each of
           the n observed covariate data points.  The distance function
           used is Gower's  nonparametric measure.

        3.  Computes a summary statistic for each counterfactual based
           on  the distances in (2):  the fraction of observed
           covariate data points with  Gower distances to your
           counterfactual less than a value you supply.  By default,
           this value is taken to be the geometric variance of the
           observed data.

        4.  Computes the cumulative frequency distribution of each
           counterfactual for the distances in (2) using Gower
           distances you supply.  By default, Gower distances from 0 to
           1 in increments of 0.05 are used.

_V_a_l_u_e:

     An object of class "whatif", a list consisting of the following 
     six or seven elements: 

    call: The original call to 'whatif'.

  inputs: A list with two elements, 'data' and 'cfact'.  Only present
          if 'return.inputs' was set equal to TRUE in the call to
          'whatif'.  The first element is the processed observed
          covariate data matrix on which all 'whatif' computations were
          performed.  The second element is the processed
          counterfactual data matrix.

 in.hull: A logical vector of length m, where m is the number of
          counterfactuals.  Each element of the vector is TRUE if the
          corresponding counterfactual is in the convex hull and FALSE
          otherwise.

gowers.dist: A m-by-n numeric matrix, where m is  the number of
          counterfactuals and n is the number of data points  (units). 
          The [i, j]th entry of the matrix contains the Gower  distance
          between the ith counterfactual and the jth data point.

geom.var: A scalar.  The geometric variance of the observed covariate
          data.

sum.stat: A numeric vector of length m, where m is the number of
          counterfactuals.   The mth element contains the summary 
          statistic for the corresponding counterfactual.  This summary
          statistic is  the fraction of data points with Gower
          distances to the counterfactual  less than the argument
          'nearby', which by default is the geometric  variability of
          the covariates.

cum.freq: A numeric matrix.  By default, the matrix has dimension
          m-by-21, where m is the number of counterfactuals; however,
          if you supplied your own frequencies via the argument 'freq',
          the matrix has dimension m-by-f, where f is the length of
          'freq'.  Each row of the matrix contains the cumulative
          frequency distribution for the corresponding counterfactual
          calculated using either the default set of Gower distance
          values or the set you supplied (see the discussion under the
          argument 'freq').  Hence, the [i, j]th entry of the matrix is
          the fraction of data points with Gower distances to the ith
          counterfactual less than or equal to the value represented by
          the jth row.  The column names contain these values.

_N_o_t_e:

     This function requires the 'lpSolve' package.

_A_u_t_h_o_r(_s):

     Stoll, Heather hstoll@polsci.ucsb.edu, King, Gary king@harvard.edu
     and Zeng, Langche zeng@ucsd.edu

_R_e_f_e_r_e_n_c_e_s:

     King, Gary and Langche Zeng.  2006a.  "The Dangers of  Extreme
     Counterfactuals."  _Political Analysis,_ forthcoming. Preprint
     available from <URL: http://gking.harvard.edu>.

     King, Gary and Langche Zeng.  2006b.  "When Can History Be Our
     Guide? The Pitfalls of Counterfactual Inference."  _International
     Studies Quarterly,_ forthcoming.  Preprint available from <URL:
     http://gking.harvard.edu>.

_S_e_e _A_l_s_o:

     'plot.whatif', 'summary.whatif', 'print.whatif',
     'print.summary.whatif'

_E_x_a_m_p_l_e_s:

     ##  Create example data sets and counterfactuals
     my.cfact <- matrix(rnorm(3*5), ncol = 5)
     my.data <- matrix(rnorm(100*5), ncol = 5)

     ##  Evaluate counterfactuals
     my.result <- whatif(data = my.data, cfact = my.cfact)

     ##  Evaluate counterfactuals and supply own gower distances for 
     ##  cumulative frequency distributions
     my.result <- whatif(cfact = my.cfact, data = my.data, freq = c(0, .25, .5, 1, 1.25, 1.5))

