| yai {yaImpute} | R Documentation |
Given a set of observations, yai 1) separates the observations
into reference and target observations, 2) applies the
specified method to project X-variables into a Euclidean space (not
always, see argument method), and 3) finds the k-nearest
neighbors within the referenece observations and between the reference
and target observations. An alternative method using randomForest
classification and regression trees is provided for steps 2 and 3.
Target observations are those with values for X-variables and
not for Y-variables, while reference observations are those
with no missing values for X-and Y-variables (see Details for the
exception).
yai(x=NULL,y=NULL,data=NULL,k=1,noTrgs=FALSE,noRefs=FALSE,
nVec=NULL,pVal=.05,method="msn",ann=TRUE,mtry=NULL,ntree=500,
rfMode="buildClasses")
x |
1) a matrix or data frame containing the X-variables for all
observations. Row names are the identification for the observation, or 2) a
one-sided formula defining the X-variables as a linear formula. If
a formula is coded for x, one must be used for y as well, if
needed. |
y |
1) a matrix or data frame containing the Y-variables for the reference observations, or 2) a one-sided formula defining the Y-variables as a linear formula. |
data |
when x and y are formulas, then data is a data frame or
matrix that contains all the variables. The observations are split by yai
into two sets. |
k |
the number of nearest neighbors; default is 1. |
noTrgs |
when TRUE, skip finding neighbors for target observations. |
noRefs |
when TRUE, skip finding neighbors for reference observations. |
nVec |
number of canonical vectors to use (methods msn and msn2),
or number of independent of X-variables reference data when method
mahalanobis. When NULL, the number is set by the function. |
pVal |
significant level for canonical vectors, used when method is
msn or msn2. |
method |
is the strategy finding neighbors; the
options are the quoted key words (see details):
euclidean - distance is computed in a normalized X space.
raw - like euclidean, except no normalization is done.
mahalanobis - distance is computed in its namesakes space.
ica - like mahalanobis, but based on Independent Component Analysis using
package fastICA.
msn - distance is computed in a projected canonical space.
msn2 - like msn, but with variance weighting (canonical regression
rather than correlation).
gnn - distance is computed using a projected ordination of
Xs found using canonical correspondence analysis
(cca from package vegan).
randomForest - distance is one minus the
proportion of randomForest trees where a target observation is in
the same terminal node as a reference observation (see randomForest).
|
ann |
TRUE if ann is used to find neighbors, FALSE if a slow search is used. |
mtry |
the number of X-variables picked at random when method is randomForest,
see randomForest, default is sqrt(number of X-variables). |
ntree |
the number of classification and regression trees when method is randomForest.
When more than one Y-variable is used, the trees are divided among the variables.
Alternatively, ntree can be a vector of values corresponding to each Y-variable. |
rfMode |
when buildClasses and method is randomForest, continuous variables
are internally converted to classes forcing randomForest to build classification trees for
the variable. Otherwise, regression trees are built if your version of
randomForest is newer than 4.5-18. |
See the paper at http://www.jstatsoft.org/v23/i10/paper (it includes examples).
The following information is in addition to the content in the papers.
You need not have any Y-variables to run yai for the following methods:
euclidean, raw, mahalanobis, ica, and
randomForest (in which case unsupervised classification is
performed). However, normally yai classifies reference
observations as those with no missing values for X- and Y- variables and
target observations are those with values for X- variables and
missing data for Y-variables. When Y is NULL (there are no Y-variables),
all the observations are considered references. See
newtargets for an example of how to use yai in this
situation.
An object of class yai, which is a list with
the following tags:
call |
the call. |
yRefs, xRefs |
matrices of the X- and Y-variables for just the reference observations (unscaled). The scale factors are attached as attributes. |
obsDropped |
a list of the row names for observations dropped for various reasons (missing data). |
trgRows |
a list of the row names for target observations as a subset of all observations. |
xall |
the X-variables for all observations. |
cancor |
returned from cancor function when method msn or
msn2 is used (NULL otherwise). |
ccaVegan |
an object of class cca (from package vegan) when method gnn is used. |
ftest |
a list containing partial F statistics and a vector of Pr>F (pgf) corresponding to the canonical correlation coefficients when method msn or msn2 is used (NULL otherwise). |
yScale, xScale |
scale data used on yRefs and xRefs as needed. |
k |
the value of k. |
pVal |
as input; only used when method msn or msn2 is used. |
projector |
NULL when not used. For methods msn, msn2, gnn and mahalanobis, this is a matrix that projects normalized X-variables into a space suitable for doing Eculidian distances. |
nVec |
number of canonical vectors used (methods msn and msn2),
or number of independent X-variables in the reference data when method
mahalanobis is used. |
method |
as input, the method used. |
ranForest |
a list of the forests if method randomForest is used. There is
one forest for each Y-variable, or just one forest when there are no
Y-variables. |
ICA |
a list of information from fastICA
when method ica is used. |
ann |
the value of ann, TRUE when ann is used, FALSE otherwise. |
xlevels |
NULL if no factors are used as predictors; otherwise a list
of predictors that have factors and their levels (see lm). |
neiDstTrgs |
a data frame of distances between a target (identified by its row name) and the k references. There are k columns. |
neiIdsTrgs |
a data frame of reference identifications that correspond to neiDstTrgs. |
neiDstRefs, neiIdsRefs |
counterparts for references. |
Nicholas L. Crookston ncrookston@fs.fed.us
Andrew O. Finley finleya@msu.edu
require (yaImpute)
data(iris)
# form some test data, y's are defined only for reference
# observations.
refs=sample(rownames(iris),50)
x <- iris[,1:2] # Sepal.Length Sepal.Width
y <- iris[refs,3:4] # Petal.Length Petal.Width
# build yai objects using 2 methods
msn <- yai(x=x,y=y)
mal <- yai(x=x,y=y,method="mahalanobis")
# running the following examples will load packages vegan
# and randomForest, and is more complicated.
data(MoscowMtStJoe)
# convert polar slope and aspect measurements to cartesian
# (which is the same as Stage's (1976) transformation).
polar <- MoscowMtStJoe[,40:41]
polar[,1] <- polar[,1]*.01 # slope proportion
polar[,2] <- polar[,2]*(pi/180) # aspect radians
cartesian <- t(apply(polar,1,function (x)
{return (c(x[1]*cos(x[2]),x[1]*sin(x[2]))) }))
colnames(cartesian) <- c("xSlAsp","ySlAsp")
x <- cbind(MoscowMtStJoe[,37:39],cartesian,MoscowMtStJoe[,42:64])
y <- MoscowMtStJoe[,1:35]
mal <- yai(x=x, y=y, method="mahalanobis", k=1)
gnn <- yai(x=x, y=y, method="gnn", k=1)
msn <- yai(x=x, y=y, method="msn", k=1)
plot(mal,vars=yvars(mal)[1:16])
# reduce the plant community data for randomForest.
yba <- MoscowMtStJoe[,1:17]
ybaB <- whatsMax(yba,nbig=7) # see help on whatsMax
rf <- yai(x=x, y=ybaB, method="randomForest", k=1)
# build the imputations for the original y's
rforig <- impute(rf,ancillaryData=y)
# compare the results
compare.yai(mal,gnn,msn,rforig)
plot(compare.yai(mal,gnn,msn,rforig))
# build another randomForest case forcing regression
# to be used for continuous variables. The answers differ
# but one is set not clearly better than the other.
rf2 <- yai(x=x, y=ybaB, method="randomForest", rfMode="regression")
rforig2 <- impute(rf2,ancillaryData=y)
compare.yai(rforig2,rforig)