transcan                package:Hmisc                R Documentation

_T_r_a_n_s_f_o_r_m_a_t_i_o_n_s/_I_m_p_u_t_a_t_i_o_n_s _u_s_i_n_g _C_a_n_o_n_i_c_a_l _V_a_r_i_a_t_e_s

_D_e_s_c_r_i_p_t_i_o_n:

     'transcan' is a nonlinear additive transformation and imputation
     function, and there are several functions for using and operating
     on its results.  'transcan' automatically transforms continuous
     and categorical variables to have maximum correlation with the
     best linear combination of the other variables.  There is also an
     option to use a substitute criterion - maximum correlation with
     the first principal component of the other variables.  Continuous
     variables are expanded as restricted cubic splines and categorical
     variables are expanded as contrasts (e.g., dummy variables).  By
     default, the first canonical variate is used to find optimum
     linear combinations of component columns.  This function is
     similar to 'ace' except that transformations for continuous
     variables are fitted using restricted cubic splines, monotonicity
     restrictions are not allowed, and NAs are allowed.  When a
     variable has any NAs, transformed scores for that variable are
     imputed using least squares multiple regression incorporating
     optimum transformations, or NAs are optionally set to constants. 
     Shrinkage can be used to safeguard against overfitting when
     imputing.  Optionally, imputed values on the original scale are
     also computed and returned.  For this purpose, recursive
     partitioning or multinomial logistic models can optionally be used
     to impute categorical variables, using what is predicted to be the
     most probable category.

     By default, 'transcan' imputes NAs with "best guess" expected
     values of transformed variables, back transformed to the original
     scale. Values thus imputed are most like conditional medians
     assuming the transformations make variables' distributions
     symmetric (imputed values are similar to conditionl modes for
     categorical variables).  By instead specifying 'n.impute',
     'transcan' does approximate multiple imputation from the
     distribution of each variable conditional on all other variables. 
     This is done by sampling 'n.impute' residuals from the transformed
     variable, with replacement (a la bootstrapping), or by default,
     using Rubin's approximate Bayesian bootstrap, where a sample of
     size n with replacement is selected from the residuals on n
     non-missing values of the target variable, and then a sample of
     size m with replacement is chosen from this sample, where m is the
     number of missing values needing imputation for the current
     multiple imputation  repetition.  Neither of these bootstrap
     procedures assume normality or even symmetry of residuals. For
     sometimes-missing categorical variables, optimal scores are
     computed by adding the "best guess" predicted mean score to random
     residuals off this score.  Then categories having scores closest
     to these predicted scores are taken as the random multiple
     imputations ('impcat="tree"' or '"rpart"' are not currently
     allowed with 'n.impute').  The literature recommends using
     'n.impute=5' or greater. 'transcan' provides only an approximation
     to multiple imputation, especially since it "freezes" the
     imputation model before drawing the multiple imputations rather
     than using different estimates of regression coefficients for each
     imputation.  For multiple imputation, the 'aregImpute' function
     provides a much better approximation to the full Bayesian approach
     while still not requiring linearity assumptions.

     When you specify 'n.impute' to 'transcan' you can use
     'fit.mult.impute' to re-fit any model 'n.impute' times based on
     'n.impute' completed datasets (if there are any sometimes missing
     variables not specified to 'transcan', some observations will
     still be dropped from these fits).  After fitting 'n.impute'
     models, 'fit.mult.impute' will return the fit object from the last
     imputation, with 'coefficients' replaced by the average of the
     'n.impute' coefficient vectors and with a component 'var' equal to
     the imputation-corrected variance-covariance matrix. 
     'fit.mult.impute' can also use the object created by the 'mice'
     function in the MICE library to draw the multiple imputations, as
     well as objects created by 'aregImpute'.

     The 'summary' method for 'transcan' prints the function call,
     R-squares achieved in transforming each variable, and for each
     variable the coefficients of all other transformed variables that
     are used to estimate the transformation of the initial variable. 
     If 'imputed=TRUE' was used in the call to transcan, also uses the
     'describe' function to print a summary of imputed values.  If
     'long=TRUE', also prints all imputed values with observation
     identifiers.  There is also a simple function 'print.transcan'
     which merely prints the transformation matrix and the function
     call.  It has an optional argument 'long', which if set to 'TRUE'
     causes detailed parameters to be printed.  Instead of plotting
     while 'transcan()' is running, you can plot the final
     transformations after the fact using 'plot.transcan', if the
     option 'trantab=TRUE' was specified to 'transcan'.  If in addition
     the option 'imputed=TRUE' was specified to 'transcan',
     'plot.transcan' will show the location of imputed values
     (including multiples) along the axes.

     'impute' does imputations for a selected original data variable,
     on the original scale (if 'imputed=TRUE' was given to 'transcan').
      If you do not specify a variable to 'impute', it will do
     imputations for all variables given to 'transcan' which had at
     least one missing value.  This assumes that the original variables
     are accessible (i.e., they have been 'attach'ed) and that you want
     the imputed variables to have the same names are the original
     variables. If 'n.impute' was specified to 'transcan' you must tell
     'impute' which 'imputation' to use.

     'predict' computes predicted variables and imputed values from a
     matrix of new data.  This matrix should have the same column
     variables as the original matrix used with 'transcan', and in the
     same order (unless a formula was used with 'transcan').

     'Function' is a generic function generator. 'Function.transcan'
     creates S functions to transform variables using transformations
     created by 'transcan'.  These functions are useful for getting
     predicted values with predictors set to values on the original
     scale.

     'Varcov' methods are defined here so that imputation-corrected
     variance-covariance matrices are readily extracted from
     'fit.mult.impute' objects, and so that 'fit.mult.impute' can
     easily compute traditional covariance matrices for individual
     completed datasets.  Specific 'Varcov' methods are defined for
     'lm', 'glm', and 'multinom' fits.

     The subscript function preserves attributes.

     The 'invertTabulated' function does either inverse linear
     interpolation or uses sampling to sample qualifying x-values
     having y-values near the desired values.  The latter is used to
     get inverse values having a reasonable distribution (e.g., no
     floor or ceiling effects) when the transformation has a flat or
     nearly flat segment, resulting in a many-to-one transformation in
     that region.  Sampling weights are a combination of the frequency
     of occurrence of x-values that are within 'tolInverse' times the
     range of 'y' and the squared distance between the associated
     y-values and the target y-value ('aty').

_U_s_a_g_e:

     transcan(x, method=c("canonical","pc"),
              categorical=NULL, asis=NULL, nk, imputed=FALSE, n.impute,
              boot.method=c('approximate bayesian', 'simple'),
              trantab=FALSE, transformed=FALSE, 
              impcat=c("score", "multinom", "rpart", "tree"),
              mincut=40, 
              inverse=c('linearInterp','sample'), tolInverse=.05,
              pr=TRUE, pl=TRUE, allpl=FALSE, show.na=TRUE, 
              imputed.actual=c('none','datadensity','hist','qq','ecdf'),
              iter.max=50, eps=.1, curtail=TRUE, 
              imp.con=FALSE, shrink=FALSE, init.cat="mode", 
              nres=if(boot.method=='simple')200 else 400,
              data, subset, na.action, treeinfo=FALSE, 
              rhsImp=c('mean','random'), details.impcat='', ...)

     ## S3 method for class 'transcan':
     summary(object, long=FALSE, ...)

     ## S3 method for class 'transcan':
     print(x, long=FALSE, ...)

     ## S3 method for class 'transcan':
     plot(x, ...)

     ## S3 method for class 'transcan':
     impute(x, var, imputation, name, where.in, data, 
            where.out=1, frame.out, list.out=FALSE, pr=TRUE, check=TRUE, ...)

     fit.mult.impute(formula, fitter, xtrans, data, n.impute, fit.reps=FALSE,
                     derived, pr=TRUE, subset, ...)

     ## S3 method for class 'transcan':
     predict(object, newdata, iter.max=50, eps=0.01, curtail=TRUE, 
             type=c("transformed","original"),
             inverse, tolInverse, ...)

     Function(object, ...)

     ## S3 method for class 'transcan':
     Function(object, prefix=".", suffix="", where=1, ...)

     invertTabulated(x, y, freq=rep(1,length(x)), 
                     aty, name='value',
                     inverse=c('linearInterp','sample'),
                     tolInverse=0.05, rule=2)

     Varcov(object, ...)

     ## Default S3 method:
     Varcov(object, regcoef.only=FALSE, ...)

     ## S3 method for class 'lm':
     Varcov(object, ...)

     ## S3 method for class 'glm':
     Varcov(object, ...)

     ## S3 method for class 'multinom':
     Varcov(object, ...)

     ## S3 method for class 'fit.mult.impute':
     Varcov(object, ...)

_A_r_g_u_m_e_n_t_s:

       x: a matrix containing continuous variable values and codes for
          categorical variables.  The matrix must have column names
          ('dimnames').  If row names are present, they are used in
          forming the 'names' attribute of imputed values if
          'imputed=TRUE'.  'x' may also be a formula, in which case the
          model matrix is created automatically, using data in the
          calling frame.  Advantages of using a formula are that
          'categorical' variables can be determined automatically by a
          variable being a 'factor' variable, and variables with two
          unique levels are modeled 'asis'. Variables with 3 unique
          values are considered to be 'categorical' if a formula is
          specified.  For a formula you may also specify that a
          variable is to remain untransformed by enclosing its name
          with the identify function, e.g. 'I(x3)'.  The user may add
          other variable names to the 'asis' and 'categorical' vectors.
           For 'invertTabulated', 'x' is a vector or a list with three
          components: the x vector, the corresponding vector of
          transformed values, and the corresponding vector of
          frequencies of the pair of original and transformed
          variables. For 'print', 'plot', 'impute', and 'predict', 'x'
          is an object created by 'transcan'. 

 formula: any S model formula 

  fitter: any S or Design modeling function (not in quotes) that
          computes a vector of 'coefficients' and for which 'Varcov'
          will return a variance-covariance matrix.  E.g., 'fitter=lm,
          glm, ols'.  At present models involving non-regression
          parameters (e.g., scale parameters in parametric survival
          models) are not handled fully. 

  xtrans: an object created by 'transcan', 'aregImpute', or 'Mice' 

  method: use 'method="canonical"' or any abbreviation thereof, to use
          canonical variates (the default).   'method="pc"' transforms
          a variable instead so as to maximize the correlation with the
          first principal component of the other variables. 

categorical: a character vector of names of variables in 'x' which are
          categorical, for which the ordering of re-scored values is
          not necessarily preserved. If 'categorical' is omitted, it is
          assumed that all variables are continuous (or binary).  Set
          'categorical="*"' to treat all variables as categorical. 

    asis: a character vector of names of variables that are not to be
          transformed. For these variables, the guts of 'lm.fit.qr' is
          used to impute missing values. You may want to treat binary
          variables 'asis' (this is automatic if using a formula).  If
          imputed=TRUE, you may want to use '"categorical"' for binary
          variables if you want to force imputed values to be one of
          the original data values. Set 'asis="*"' to treat all
          variables 'asis'. 

      nk: number of knots to use in expanding each continuous variable
          (not listed in 'asis') in a restricted cubic spline function.
           Default is 3 (yielding 2 parameters for a variable) if 'n <
          30', 4 if '30 <= n < 100', and 5 if 'n >= 100' (4
          parameters). 

 imputed: Set to 'TRUE' to return a list containing imputed values on
          the original scale. If the transformation for a variable is
          non-monotonic, imputed values are not unique.  'transcan'
          uses the 'approx' function, which returns the highest value
          of the variable with the transformed score equalling the
          imputed score. 'imputed=TRUE' also causes original-scale
          imputed values to be shown as tick marks on the top margin of
          each graph when 'show.na=TRUE' (for the final iteration
          only). For categorical predictors, these imputed values are
          'jitter'ed so that their frequencies can be visualized.  When
          'n.impute' is used, each NA will have 'n.impute' tick marks. 

n.impute: number of multiple imputations.  If omitted, single predicted
          expected value imputation is used.  'n.impute=5' is
          frequently recommended. 

boot.method: default is to use the approximate Bayesian bootstrap
          (sample with replacement from sample with replacement of the
          vector of residuals). You can also specify
          'boot.method="simple"' to use the usual bootstrap one-stage
          sampling with replacement. 

 trantab: Set to 'TRUE' to add an attribute 'trantab' to the returned
          matrix.  This contains a vector of lists each with components
          'x' and 'y' containing the unique values and corresponding
          transformed values for the columns of 'x'.  This is set up to
          be used easily with the 'approx' function.  You must specify
          'trantab=TRUE' if you want to later use the
          'predict.transcan' function with 'type="original"'. 

transformed: set to 'TRUE' to cause 'transcan' to return an object
          'transformed' containing the matrix of transformed variables 

  impcat: This argument tells how to impute categorical variables on
          the original scale.  The default is 'impcat="score"' to
          impute the category whose canonical variate score is closest
          to the predicted score. Use 'impcat="tree"' to impute
          categorical variables using the 'tree()' function, using the
          values of all other transformed predictors.  'impcat="rpart"'
          will use 'rpart'.  A better but somewhat slower approach is
          to use 'impcat="multinom"' to fit a multinomial logistic
          model to the categorical variable, at the last iteraction of
          the 'transcan' algorithm.  This uses the 'multinom' function
          in the 'nnet' library of the 'MASS' package (which is assumed
          to have been installed by the user) to fit a polytomous
          logistic model to the current working transformations of all
          the other variables (using conditional mean imputation for
          missing predictors).  Multiple imputations are made by
          drawing multinomial values from the vector of predicted
          probabilities of category membership for the missing
          categorical values. 

  mincut: If 'imputed=TRUE', there are categorical variables, and
          'impcat="tree"', 'mincut' specifies the lowest node size that
          will be allowed to be split by 'tree'.  The default is 40. 

 inverse: By default, imputed values are back-solved on the original
          scale using inverse linear interpolation on the fitted
          tabulated transformed values. This will cause distorted
          distributions of imputed values (e.g., floor and ceiling
          effects) when the estimated transformation has a flat or
          nearly flat section.  To instead use the 'invertTabulated'
          function (see above) with the '"sample"' option, specify
          'inverse="sample"'. 

tolInverse: the multiplyer of the range of transformed values, weighted
          by 'freq' and by the distance measure, for determining the
          set of x  values having y values within a tolerance of the
          value of 'aty' in 'invertTabulated'.  For 'predict.transcan',
          'inverse' and 'tolInverse' are obtained from options that
          were specified to 'transcan' by default.  Otherwise, if not
          specified by the user, these default to the defaults used to
          'invertTabulated'. 

      pr: For 'transcan', set to 'FALSE' to suppress printing r-squares
          and shrinkage factors.  For 'impute.transcan' set to 'FALSE'
          to suppress messages concerning the number of NAs imputed, or
          for 'fit.mult.impute' set to 'FALSE' to suppress printing
          variance inflation factors accounting for imputation, rate of
          missing information, and degrees of freedom. 

      pl: Set to 'FALSE' to suppress plotting the final transformations
          with  distribution of scores for imputed values (if
          'show.na=TRUE'). 

   allpl: Set to 'TRUE' to plot transformations for intermediate
          iterations. 

 show.na: Set to 'FALSE' to suppress the distribution of scores
          assigned to missing values (as tick marks on the right margin
          of each graph). See also 'imputed'. 

imputed.actual: The default is '"none"' to suppress plotting of actual
          vs. imputed values for all variables having any NAs.   Other
          choices are '"datadensity"' to use 'datadensity' to make a
          single plot, '"hist"' to make a series of back-to-back
          histograms, '"qq"' to make a series of q-q plots, or '"ecdf"'
          to make a series of empirical cdfs.  For
          'imputed.actual="datadensity"' for example you get  a rug
          plot of the non-missing values for the variable with beneath
          it a rug plot of the imputed values. When 'imputed.actual' is
          not '"none"', 'imputed' is automatically set to 'TRUE'. 

iter.max: maximum number of iterations to perform for 'transcan' or
          'predict'. For 'predict', only one iteration is used if there
          are no NAs in the data or if 'imp.con' was used. 

     eps: convergence criterion for 'transcan' and 'predict'.  'eps' is
          the maximum change in transformed values from one iteration
          to the next.  If for a given iteration all new
          transformations of variables differ by less than 'eps' (with
          or without negating the transformation to allow for
          "flipping") from the transformations in the previous
          iteration, one more iteration is done for 'transcan'.  
          During this last iteration, individual transformations are
          not updated but coefficients of transformations are.  This
          improves stability of coefficients of canonical variates on
          the right-hand-side. 'eps' is ignored when 'rhsImp="random"'. 

 curtail: for 'transcan', causes imputed values on the transformed
          scale to be truncated so that their ranges are within the
          ranges of  non-imputed transformed values. For 'predict',
          'curtail' defaults to 'TRUE' to truncate predicted
          transformed values to their ranges in the original fit
          ('xt'). 

 imp.con: for 'transcan', set to 'TRUE' to impute NAs on the original
          scales with constants (medians or most frequent category
          codes).  Set to a vector of constants to instead always use
          these constants for imputation. These imputed values are
          ignored when fitting the current working transformation for a
          single variable. 

  shrink: default is 'FALSE' to use ordinary least squares or canonical
          variate estimates. For the purposes of imputing NAs, you may
          want to set 'shrink=TRUE' to avoid overfitting when
          developing a prediction equation to predict each variables
          from all the others (see details below). 

init.cat: method for initializing scorings of categorical variables. 
          Default is '"mode"' to use a dummy variable set to 1 if the
          value is the most frequent value (this is the default).   Use
          '"random"' to use a random 0-1 variable.  Set to '"asis"' to
          use the original integer codes as starting scores. 

    nres: number of residuals to store if 'n.impute' is specified.  If
          the dataset has fewer than 'nres' observations, all residuals
          are saved. Otherwise a random sample of the residuals of
          length 'nres' without replacement is saved.  The default for
          'nres' is higher if 'boot.method="approximate bayesian"'. 

    data: 

  subset: an integer or logical vector specifying the subset of
          observations to fit

na.action: These may be used if 'x' is a formula.  The default
          'na.action' is 'na.retain' (defined by 'transcan') which
          keeps all observations with any 'NA's. For 'impute.transcan',
          'data' is a data frame to use as the source of variables to
          be imputed, rather than using 'where.in'.  For
          'fit.mult.impute', 'data' is mandatory and is a data frame
          containing the data to be used in fitting the model but
          before imputations are applied.  Variables omitted from
          'data' are assumed to be available from frame 1 and do not
          need to be imputed. 

treeinfo: Set to 'TRUE' to get additional information printed when
          'impcat="tree"', such as the predicted probabilities of
          category membership. 

  rhsImp: Set to '"random"' to use random draw imputation when a
          sometimes missing variable is moved to be a predictor of
          other sometimes missing variables.  Default is
          'rhsImp="mean"', which uses conditional mean imputation on
          the transformed scale.  Residuals used are residuals from the
          transformed scale.  When '"random"' is used, 'transcan' runs
          5 iterations and ignores 'eps'. 

details.impcat: set to a character scalar that is the name of a
          category variable to include in the resulting 'transcan'
          object an element 'details.impcat' containing details of how
          the categorical variable was multiply imputed.

     ...: arguments passed to 'scat1d' or to the 'fitter' function (for
          'fit.mult.impute') 

    long: for 'summary', set to 'TRUE' to print all imputed values. For
          'print', set to 'TRUE' to print details of
          transformations/imputations. 

     var: For 'impute', is a variable that was originally a column in
          'x', for which imputated values are to be filled in.
          'imputed=TRUE' must have been used in 'transcan'.  Omit 'var'
          to impute all variables, creating new variables in 'search'
          position 'where'. 

imputation: specifies which of the multiple imputations to use for
          filling in NAs 

    name: name of variable to impute, for 'impute()'.  Default is
          character string version of the second argument ('var') in
          the call to 'impute'. For 'invertTabulated', is the name of
          variable being transformed (used only for warning messages). 

where.in: location in 'search' list to find variables that need to be
          imputed, when all variables are to be imputed automatically
          by 'impute.transcan' (i.e., when no input variable name is
          specified). Default is first 'search' position that contains
          the first variable to be imputed. 

where.out: location in the 'search' list for storing variables with
          missing values set to imputed values, for 'impute.transcan'
          when all variables with missing values are being imputed
          automatically. 

frame.out: Instead of specifying 'where.out' you can specify an S frame
          number into which individual new imputed variables will be
          written. For example, 'frame.out=1' is useful for putting new
          variables into a temporary local frame when 'impute' is
          called within another function (see 'fit.mult.impute').  See
          'assign' for details about frames. 

list.out: If 'var' is not specified, you can set 'list.out=TRUE' to
          have 'impute.transcan' return a list containing variables
          with needed values imputed.  This list will contain a single
          imputation. 

   check: set to 'FALSE' to suppress certain warning messages 

 newdata: a new data matrix for which to compute transformed variables.
          Categorical variables must use the same integer codes as were
          used in the call to 'transcan'.  If a formula was originally
          specified to 'transcan' (instead of a data matrix), 'newdata'
          is optional and if given must be a data frame; a model frame
          is generated automatically from the previous formula.  The
          'na.action' is handled automatically, and the levels for
          factor variables must be the same and in the same order as
          were used in the original variables specified in the formula
          given to 'transcan'. 

fit.reps: set to 'TRUE' to save all fit objects from the fit for each
          imputation in 'fit.mult.impute'.  Then the object returned
          will have a component 'fits' which is a list whose 'i'th
          element is the 'i'th fit object. 

 derived: an expression containing S expressions for computing derived
          variables that are used in the model formula.  This is useful
          when multiple imputations are done for component variables
          but the actual model uses combinations of these (e.g., ratios
          or other derivations). For a single derived variable you can
          specified for example 'derived=expression(ratio <-
          weight/height)'.  For multiple derived variables use the form
          'derived=expression({ratio <- weight/height; product <-
          weight*height})' or put the expression on separate input
          lines.   To monitor the multiply-imputed derived variables
          you can add to the 'expression' a command such as
          'print(describe(ratio))'.  See the example below. 

    type: By default, the matrix of transformed variables is returned,
          with imputed values on the transformed scale.  If you had
          specified 'trantab=TRUE' to 'transcan', specifying
          'type="original"' does the table look-ups with linear
          interpolation to return the input matrix 'x' but with imputed
          values on the original scale inserted for NAs.  For
          categorical variables, the method used here is to select  the
          category code having a corresponding scaled value closest to
          the predicted transformed value.  This corresponds to the
          default 'impcat'; a problem in getting predicted values for
          'tree' objects prevented using 'tree' for this.  Note:
          imputed values thus returned when 'type="original"' are
          single expected value imputations even in 'n.impute' is
          given. 

  object: an object created by  'transcan', or an object to be
          converted to S function code, typically a model fit object of
          some sort

  prefix: 

  suffix: When creating separate S functions for each variable in 'x',
          the name of the new function will be 'prefix' placed in front
          of the variable name, and 'suffix' placed in back of the
          name.  The default is to use names of the form '.varname',
          where 'varname' is the variable name. 

   where: position in 'search' list at which to store new functions
          (for 'Function'). Default is position 1 in the search list. 
          See the 'assign' function for more documention on the 'where'
          argument. 

       y: a vector corresponding to 'x' for 'invertTabulated', if its
          first argument 'x' is not a list 

    freq: a vector of frequencies corresponding to cross-classified 'x'
          and 'y' if 'x' is not a list.  Default is a vector of ones. 

     aty: vector of transformed values at which inverses are desired 

    rule: see 'approx'.  'transcan' assumes 'rule' is always '2' 

regcoef.only: set to 'TRUE' to make 'Varcov.default' delete positions
          in the covariance matrix for any non-regression coefficients
          (e.g., log scale parameter from 'psm' or 'survreg')

_D_e_t_a_i_l_s:

     The starting approximation to the transformation for each variable
     is taken to be the original coding of the variable.  The initial
     approximation for each missing value is taken to be the median of
     the non-missing values for the variable (for continuous ones) or
     the most frequent category (for categorical ones).  Instead, if
     'imp.con' is a vector, its values are used for imputing NAs.  When
     using each variable as a dependent variable, NAs on that variable
     cause all observations to be temporarily deleted.  Once a new
     working transformation is found for the variable, along with a
     model to predict that transformation from all the other variables,
     that latter model is used to impute NAs in the selected dependent
     variable if 'imp.con' is not specified.   When that variable is
     used to predict a new dependent variable, the current working
     imputed values are inserted.  Transformations are updated after
     each variable becomes a dependent variable, so the order of
     variables on 'x' could conceivably make a difference in the final
     estimates.  For obtaining out-of-sample
     predictions/transformations, 'predict' uses the same iterative
     procedure as 'transcan' for imputation, with the same starting
     values for fill-ins as were used by 'transcan'.  It also (by
     default) uses a conservative approach of curtailing transformed
     variables to be within the range of the original ones. Even when
     'method="pc"' is specified, canonical variables are used for
     imputing missing values.

     Note that fitted transformations, when evaluated at imputed
     variable values (on the original scale), will not precisely match
     the transformed imputed values returned in 'xt'.  This is because
     'transcan' uses an approximate method based on linear
     interpolation to back-solve for imputed values on the original
     scale.

     Shrinkage uses the method of Van Houwelingen and Le Cessie (1990)
     (similar to  Copas, 1983).  The shrinkage factor is
     '[1-(1-R2)(n-1)/(n-k-1)]/R2', where 'R2' is the apparent R-squared
     for predicting the variable, 'n' is the number of non-missing
     values, and 'k' is the effective number of degrees of freedom
     (aside from intercepts).  A heuristic estimate is used for 'k': 'A
     - 1 + sum(max(0,Bi-1))/m + m', where  'A' is the number of d.f.
     required to represent the variable being predicted, the 'Bi' are
     the number of columns required to represent all the other
     variables, and 'm' is the number of all other variables.  Division
     by 'm' is done because the transformations for the other variables
     are fixed at their current transformations the last time they were
     being predicted.  The '+ m' term comes from the number of
     coefficients estimated on the right hand side, whether by least
     squares or canonical variates.  If a shrinkage factor is negative,
     it is set to 0.  The shrinkage factor is the ratio of the adjusted
     R-squared to the ordinary R-squared. The adjusted R-squared is '1
     - (1 - R2)(n-1)/(n-k-1)', which is also set to zero if it is
     negative.  If 'shrink=FALSE' and the adjusted R-squares are much 
     smaller than the ordinary R-squares, you may want to run
     'transcan' with 'shrink=TRUE'.

     Canonical variates are scaled to have variance of 1.0, by
     multiplying canonical coefficients from 'cancor' by 'sqrt(n-1)'.

     When specifying a non-Design library fitting function to
     'fit.mult.impute' (e.g., 'lm', 'glm'), running the result of
     'fit.mult.impute' through that fit's 'summary' method will not use
     the imputation-adjusted variances.  You may obtain the new
     variances using 'fit$var' or 'Varcov(fit)'.  

     When you specify a Design function to 'fit.mult.impute' (e.g.,
     'lrm, ols, cph, psm, bj'), automatically computed transformation
     parameters (e.g., knot locations for 'rcs') that are estimated for
     the first imputation are used for all other imputations.  This
     ensures that knot locations will not vary, which would change the
     meaning of the regression coefficients.

     Warning: even though 'fit.mult.impute' takes imputation into
     account when estimating variances of regression coefficient, it
     does not take into account the variation that results from
     estimation of the shapes and regression coefficients of the
     customized imputation equations. Specifying 'shrink=TRUE' solves a
     small part of this problem.  To fully account for all sources of
     variation you should consider putting the 'transcan' invocation
     inside a bootstrap or loop, if execution time allows.  Better
     still, use 'aregImpute' or one of the libraries such as MICE that
     uses real Bayesian posterior realizations to multiply impute
     missing values correctly.

     It is strongly recommended that you use the Hmisc 'naclus'
     function to determine is there is a good basis for imputation. 
     'naclus' will tell you, for example, if systolic blood pressure is
     missing whenever diastolic blood pressure is missing.  If the only
     variable that is well correlated with diastolic bp is systolic bp,
     there is no basis for imputing diastolic bp in this case.

     At present, 'predict' does not work with multiple imputation.

     When calling 'fit.mult.impute' with 'glm' as the 'fitter'
     argument, if you need to pass a 'family' argument to 'glm' do it
     by quoting the family, e.g., 'family="binomial"'.

     You should be able to use a variable in the formula given to
     'fit.mult.impute' as a numeric variable in the regression model
     even though it was a factor variable in the invocation of
     'transcan'.  Use for example 'fit.mult.impute(y ~ codes(x), lrm,
     trans)' (thanks to Trevor Thompson trevor@hp5.eushc.org).

_V_a_l_u_e:

     For 'transcan', a list of class 'transcan' with elements 'call'
     (with the function call), 'iter' (number of iterations done) and
     'rsq' and 'rsq.adj' containing the R-squares and adjusted
     R-squares achieved in predicting each variable from all the
     others.  It also has elements 'categorical', 'asis', 'coef',
     'xcoef', 'parms', 'fillin', 'ranges', 'scale', and 'formula'
     containing respectively the values supplied for 'categorical' and
     'asis', the within-variable coefficients used to compute the first
     canonical variate, the (possibly shrunk) across-variables
     coefficients of the first canonical variate that predicts each
     variable in turn, the parameters of the transformation (knots for
     splines, contrast matrix for categorical variables), the initial
     estimates for missing values (NA if variable never missing), the
     matrix of ranges of the transformed variables (min and max in
     first and second row), a vector of scales used to determine
     convergence for a transformation, the formula (if 'x' was a
     formula), and optionally a vector of shrinkage factors used for
     predicting each variable from the others.  For '"asis"' variables,
     the scale is the average absolute difference about the median. 
     For other variables it is unity, since canonical variables are
     standardized.  For 'xcoef', row 'i' has the coefficients to
     predict transformed variable 'i', with the column for the
     coefficient of variable 'i' set to NA.  If 'imputed=TRUE' was
     given, an optional element 'imputed' also appears.  This is a list
     with the vector of imputed values (on the original scale) for each
     variable containing NAs.  Matrices rather than vectors are
     returned if 'n.impute' is given.  If 'trantab=TRUE, the `trantab'
     element also appears, as described above.  If 'n.impute > 0',
     'transcan' also returns a list 'residuals' that can be used for
     future multiple imputation.

     'impute' returns a vector (the same length as 'var') of class
     '"impute"' with NAs imputed.  'predict' returns a matrix with the
     same number of columns or variables as were in 'x'.

     'fit.mult.impute' returns a fit object that is a modification of
     the fit object created by fitting the completed dataset for the
     final imputation.  The 'var' matrix in the fit object has the
     imputation-corrected variance-covariance matrix.  'coefficients'
     is the average (over imputations) of the coefficient vectors,
     'variance.inflation.impute' is a vector containing the ratios of
     the diagonals of the between-imputation variance matrix to the
     diagonals of the average apparent (within-imputation) variance
     matrix. 'missingInfo' is Rubin's "rate of missing information" and
     'dfmi' is Rubin's degrees of freedom for a t-statistic for testing
     a single parameter.  The last two objects are vectors
     corresponding to the diagonal of the variance matrix.

_S_i_d_e _E_f_f_e_c_t_s:

     prints, plots, and 'impute.transcan' creates new variables.

_A_u_t_h_o_r(_s):

     Frank Harrell 
      Department of Biostatistics 
      Vanderbilt University 
      f.harrell@vanderbilt.edu

_R_e_f_e_r_e_n_c_e_s:

     Kuhfeld, Warren F: The PRINQUAL Procedure.  SAS/STAT User's Guide,
     Fourth Edition, Volume 2, pp. 1265-1323, 1990.

     Van Houwelingen JC, Le Cessie S: Predictive value of statistical
     models. Statistics in Medicine 8:1303-1325, 1990.

     Copas JB: Regression, prediction and shrinkage. JRSS B 45:311-354,
     1983.

     He X, Shen L: Linear regression after spline transformation.
     Biometrika 84:474-481, 1997.

     Little RJA, Rubin DB: Statistical Analysis with Missing Data.  New
     York: Wiley, 1987.

     Rubin DJ, Schenker N: Multiple imputation in health-care
     databases: An overview and some applications.  Stat in Med
     10:585-598, 1991.

     Faris PD, Ghali WA, et al:Multiple imputation versus data
     enhancement for dealing with missing data in observational health
     care outcome analyses.  J Clin Epidem 55:184-191, 2002.

_S_e_e _A_l_s_o:

     'aregImpute', 'impute', 'naclus', 'naplot', 'ace', 'avas',
     'cancor', 'prcomp', 'rcspline.eval',  'lsfit', 'approx',
     'datadensity', 'mice'

_E_x_a_m_p_l_e_s:

     ## Not run: 
     x <- cbind(age, disease, blood.pressure, pH)  
     #cbind will convert factor object `disease' to integer
     par(mfrow=c(2,2))
     x.trans <- transcan(x, categorical="disease", asis="pH",
                         transformed=TRUE, imputed=TRUE)
     summary(x.trans)  #Summary distribution of imputed values, and R-squares
     f <- lm(y ~ x.trans$transformed)   #use transformed values in a regression
     #Now replace NAs in original variables with imputed values, if not
     #using transformations
     age            <- impute(x.trans, age)
     disease        <- impute(x.trans, disease)
     blood.pressure <- impute(x.trans, blood.pressure)
     pH             <- impute(x.trans, pH)
     #Do impute(x.trans) to impute all variables, storing new variables under
     #the old names
     summary(pH)       #uses summary.impute to tell about imputations
                       #and summary.default to tell about pH overall
     # Get transformed and imputed values on some new data frame xnew
     newx.trans     <- predict(x.trans, xnew)
     w              <- predict(x.trans, xnew, type="original")
     age            <- w[,"age"]            #inserts imputed values
     blood.pressure <- w[,"blood.pressure"]
     Function(x.trans)  #creates .age, .disease, .blood.pressure, .pH()
     #Repeat first fit using a formula
     x.trans <- transcan(~ age + disease + blood.pressure + I(pH), 
                         imputed=TRUE)
     age <- impute(x.trans, age)
     predict(x.trans, expand.grid(age=50, disease="pneumonia",
             blood.pressure=60:260, pH=7.4))
     z <- transcan(~ age + factor(disease.code),  # disease.code categorical
                   transformed=TRUE, trantab=TRUE, imputed=TRUE, pl=FALSE)
     plot(z$transformed)
     ## End(Not run)

     # Multiple imputation and estimation of variances and covariances of
     # regression coefficient estimates accounting for imputation
     set.seed(1)
     x1 <- factor(sample(c('a','b','c'),100,TRUE))
     x2 <- (x1=='b') + 3*(x1=='c') + rnorm(100)
     y  <- x2 + 1*(x1=='c') + rnorm(100)
     x1[1:20] <- NA
     x2[18:23] <- NA
     d <- data.frame(x1,x2,y)
     n <- naclus(d)
     plot(n); naplot(n)  # Show patterns of NAs
     f  <- transcan(~y + x1 + x2, n.impute=10, shrink=FALSE, data=d)
     options(digits=3)
     summary(f)

     f  <- transcan(~y + x1 + x2, n.impute=10, shrink=TRUE, data=d)
     summary(f)

     h <- fit.mult.impute(y ~ x1 + x2, lm, f, data=d)
     # Add ,fit.reps=TRUE to save all fit objects in h, then do something like:
     # for(i in 1:length(h$fits)) print(summary(h$fits[[i]]))

     diag(Varcov(h))

     h.complete <- lm(y ~ x1 + x2, na.action=na.omit)
     h.complete
     diag(Varcov(h.complete))

     # Note: had Design's ols function been used in place of lm, any
     # function run on h (anova, summary, etc.) would have automatically
     # used imputation-corrected variances and covariances

     # Example demonstrating how using the multinomial logistic model
     # to impute a categorical variable results in a frequency
     # distribution of imputed values that matches the distribution
     # of non-missing values of the categorical variable

     ## Not run: 
     set.seed(11)
     x1 <- factor(sample(letters[1:4], 1000,TRUE))
     x1[1:200] <- NA
     table(x1)/sum(table(x1))
     x2 <- runif(1000)
     z  <- transcan(~ x1 + I(x2), n.impute=20, impcat='multinom')
     table(z$imputed$x1)/sum(table(z$imputed$x1))
     ## End(Not run)

     # Example where multiple imputations are for basic variables and
     # modeling is done on variables derived from these

     set.seed(137)
     n <- 400
     x1 <- runif(n)
     x2 <- runif(n)
     y  <- x1*x2 + x1/(1+x2) + rnorm(n)/3
     x1[1:5] <- NA
     d <- data.frame(x1,x2,y)
     w <- transcan(~ x1 + x2 + y, n.impute=5, data=d)
     # Add ,show.imputed.actual for graphical diagnostics
     ## Not run: 
     g <- fit.mult.impute(y ~ product + ratio, ols, w,
                          data=data.frame(x1,x2,y),
                          derived=expression({
                            product <- x1*x2
                            ratio   <- x1/(1+x2)
                            print(cbind(x1,x2,x1*x2,product)[1:6,])}))
     ## End(Not run)

     # Here's a method for creating a permanent data frame containing
     # one set of imputed values for each variable specified to transcan
     # that had at least one NA, and also containing all the variables
     # in an original data frame.  The following is based on the fact
     # that the default output location for impute.transcan is
     # given by where.out=1 (search position 1)

     ## Not run: 
     xt <- transcan(~. , data=mine,
                    imputed=TRUE, shrink=TRUE, n.impute=10, trantab=TRUE)
     attach(mine, pos=1, use.names=FALSE)
     impute(xt, imputation=1) # use first imputation
     # omit imputation= if using single imputation
     detach(1, 'mine2')
     ## End(Not run)

     # Example of using invertTabulated outside transcan
     x    <- c(1,2,3,4,5,6,7,8,9,10)
     y    <- c(1,2,3,4,5,5,5,5,9,10)
     freq <- c(1,1,1,1,1,2,3,4,1,1)
     # x=5,6,7,8 with prob. .1 .2 .3 .4 when y=5
     # Within a tolerance of .05*(10-1) all y's match exactly
     # so the distance measure does not play a role
     set.seed(1)      # so can reproduce
     for(inverse in c('linearInterp','sample'))
      print(table(invertTabulated(x, y, freq, rep(5,1000), inverse=inverse)))

     # Test inverse='sample' when the estimated transformation is
     # flat on the right.  First show default imputations
     set.seed(3)
     x <- rnorm(1000)
     y <- pmin(x, 0)
     x[1:500] <- NA
     for(inverse in c('linearInterp','sample')) {
     par(mfrow=c(2,2))
       w <- transcan(~ x + y, imputed.actual='hist',
                     inverse=inverse, curtail=FALSE,
                     data=data.frame(x,y))
       if(inverse=='sample') next
     # cat('Click mouse on graph to proceed\n')
     # locator(1)
     }

