
\name{coding}
\alias{coding}
\title{combines two or more surrogate/auxiliary variables into a vector} 


\description{

   recodes a matrix of categorical variables into a vector which takes 
   a unique value for each combination \cr


\bold{BACKGROUND}

From the matrix Z of first-stage covariates, this function creates 
a vector which takes a unique value for each combination as follows:

	\tabular{rrrr}{
	z1 \tab z2 \tab z3 \tab new.z \cr
	0 \tab 0 \tab 0 \tab 1 \cr
	1 \tab 0 \tab 0 \tab 2 \cr
	0 \tab 1 \tab 0 \tab 3 \cr
	1 \tab 1 \tab 0 \tab 4 \cr
	0 \tab 0 \tab 1 \tab 5 \cr
	1 \tab 0 \tab 1 \tab 6 \cr
	0 \tab 1 \tab 1 \tab 7 \cr
	1 \tab 1 \tab 1 \tab 8 \cr
	}

If some of the combinations do not exist, the function will adjust
accordingly: for example if the combination (0,1,1) is absent above,
then (1,1,1) will be coded as 7. \cr

The values of this new.z are reported as \code{new.z} in the printed output 
(see \code{value} below) \cr

This function should be run on second stage data prior to using
the \code{\link[meanscore]{ms.nprev}} function, as it illustrates the order 
in which the call to ms.nprev expects the first-stage sample sizes to be provided.
}


\usage{
coding(x=x,y=y,z=z,return=FALSE)
}

\arguments{
REQUIRED ARGUMENTS

\item{y}{response variable (should be binary 0-1)}
\item{x}{matrix of predictor variables for regression model} 
\item{z}{matrix of any surrogate or auxiliary variables \cr


OPTIONAL ARGUMENTS}

\item{return}{logical value; if it's TRUE(T) the original surrogate
	  or auxiliary variables and the re-coded auxilliary 
	  variables will be returned.   
	  The default is FALSE (F). 
}
}
\value{
This function does not return any values \bold{except} if \code{return}=T. \cr

If used with only second stage (i.e. complete) data, it will print the 
following:
\item{ylevel}{the distinct values (or levels) of y}
\item{\eqn{\bold{z}1 \dots \bold{z}i}}{the distinct values of first stage variables 
\eqn{\bold{z}1 \dots \bold{z}i}}
\item{new.z}{recoded first stage variables. Each value represents a unique combination of 
first stage variable values.}
\item{n2}{second stage sample sizes in each (\code{ylevel},\code{new.z}) stratum. \cr

If used with combined first and second stage data (i.e. with NA for 
missing values), in addition to the above items, the function will also print the following:}

\item{n1}{first-stage sample sizes in each (\code{ylevel},\code{new.z}) stratum.}

}

\examples{

\dontrun{The ectopic data set has 3 categorical first-stage variables in columns 
3 to 5, which together with column 2 are the predictor variables of the
dichotomous outcome in column 1 (see help(ectopic) for further details). Typing
}
data(ectopic)
coding(x=ectopic[,2:5],y=ectopic[,1], z=ectopic[,3:5])

\dontrun{gives the following coding scheme and first-stage and second-stage 
sample sizes (n1 and n2 respectively)
}

\dontrun{
 ylevel gonnorhoea contracept sexpatr new.z  n1 n2
      0          0          0       0     1  56 13
      0          0          1       0     2 146 36
      0          0          0       1     3 119 33
      0          1          0       1     4  19  8
      0          0          1       1     5 344 93
      0          1          1       1     6  31  9
      1          0          0       0     1  26 11
      1          0          1       0     2   9  5
      1          0          0       1     3 160 79
      1          1          0       1     4  29 18
      1          0          1       1     5  35 20
      1          1          1       1     6   5  2
}
}

\seealso{
\code{\link[meanscore]{meanscore}},\code{\link[meanscore]{ms.nprev}},
\code{\link[meanscore]{ectopic}},\code{\link[meanscore]{simNA}},\code{\link{glm}}.
}

\keyword{utilities}

\eof
\name{ectopic}
\alias{ectopic}
\title{The Ectopic Pregnancy Dataset}

\description{

This dataset, which was analysed in Table 3 of Reilly and
Pepe (1995) is from a case-control study of the association
between ectopic pregnancy and sexually transmitted diseases(STDs). 
The total sample size is 979, 264 cases and 715 controls.
The  variables collected from the beginning of
the study included gonnorhoea, contraceptive use and sexual
partners (see \bold{Format}).

One year after the study began, the investigators started
collecting serum samples for determining chlamydia antibody
status in all cases and in a 50 percent subsample of controls. 
As a result, only 327 out of the 979 patients have measurements 
for chlamydia antibody.
}

\usage{
data(ectopic)
}

\format{

The dataset has 979 observations with 5 variables arranged in the
following columns: \cr

Column 1 (Pregnancy) \cr
The ectopic pregnancy status of patients at the time of interview \cr
(0 = No, 1 = Yes) \cr

Column 2 (Chlamydia) \cr
The chlamydia antibody status of patients (0 = No, 1 = Yes). \cr
There are some observations with missing values, indicating that
at the time these patients were enrolled, the investigators
had not yet started to record chlamydia antibody status.

Column 3 (Gonnorhoea) \cr
(0 = No, 1 = Yes) \cr

Column 4 (Contracept) \cr
The use of contraceptives \cr 
(0 = No, 1 = Yes) \cr

Column 5 (Sexpatr) \cr
Multiple sex partners (0 = No, 1 = Yes) \cr
}

\source{
Sherman,\emph{et.al.}(1990)
}

\references{
	Reilly,M and M.S. Pepe. 1995. A mean score method for 
		missing and auxiliary covariate data in 
		regression models. \emph{Biometrika} \bold{82}:299-314 \cr
	Sherman, K.J., \emph{et.al.} .1990. Sexually transmitted diseases
		and tubal pregnancy. \emph{Sex. Transm.Dis.} \bold{7}: 115-21
}

\keyword{datasets}



\eof
\name{meanscore}
\alias{meanscore}
\title{Mean Score Method for Missing Covariate Data in Logistic Regression Models}
\description{
Weighted logistic regression using the Mean Score method}

\usage{
	meanscore(x=x,y=y,z=z,factor=NULL,print.all=FALSE)
}



\arguments{

\item{x}{matrix of predictor variables, one column
	  of which contains some missing values (NA)}
\item{y}{response variable (binary 0-1)}
\item{z}{matrix of the surrogate or auxiliary variables 
          which must be categorical \cr
	  
OPTIONAL ARGUMENTS}

\item{print.all}{logical value determining all output to be printed. 
		 The default is False (F).} 
\item{factor}{factor variables; if the columns of the matrix of
	  predictor variables have names, supply these names, 
	  otherwise supply the column numbers. MS.NPREV will fit 
	  separate coefficients for each level of the factor variables.}
}
\value{

A list called "parameters" containing the following 
	will be returned:

\item{est}{the vector of estimates of the regression coefficients}
\item{se}{the vector of standard errors of the estimates}
\item{z}{Wald statistic for each coefficient}
\item{pvalue}{2-sided p-value (H0: coeff=0) \cr

when print.all = TRUE, it will also return the following lists:}

\item{Ihat}{the Fisher information matrix} 
\item{varsi}{variance of the score for each (ylevel,zlevel) stratum}
}

\details{
	The response, predictor and surrogate variables 
	must be numeric. The function will automatically
	call the CODING function to recode the z matrix 
      to give a \code{new.z} vector which takes a unique value
      for each combination (type help(\code{\link[meanscore]{coding}}) for further
      particulars), as follows:
\tabular{rrrr}{
	z1 \tab z2 \tab z3 \tab new.z \cr
	0 \tab 0 \tab 0 \tab 1 \cr
	1 \tab	0 \tab 0 \tab 2 \cr
	0 \tab 1 \tab 0 \tab 3 \cr
	1 \tab 1 \tab 0 \tab 4 \cr
	0 \tab	0 \tab 1 \tab 5 \cr
	1 \tab 0 \tab 1 \tab 6 \cr
	0 \tab	1 \tab 1 \tab 7 \cr
	1 \tab 1 \tab 1 \tab 8 \cr
	}

	The values of this new.z are reported as \code{new.z} see 
	\code{\link[meanscore]{coding}}.
}
\examples{
\dontrun{
THE SIMULATED DATASET EXAMPLE}

\dontrun{We use the simulated dataset which is stored in the matrix simNA.
You can load the dataset using:}

data(simNA) 

help (simNA)
#gives a detailed description of the data.
      
\dontrun{To analyze this data using the meanscore function:}

meanscore(y=simNA[,1],z=simNA[,2],x=simNA[,3])

\dontrun{This will give the following:

[1] "For calls to ms.nprev, input n1 or prev in the following order!!"
     ylevel z new.z  n1  n2
[1,]      0 0     0 310 150
[2,]      0 1     1 166  85
[3,]      1 0     0 177  86
[4,]      1 1     1 347 179

$parameters
                  est         se          z    pvalue
(Intercept) 0.0493998 0.07155138  0.6904103 0.4899362
x           1.0188437 0.10187094 10.0013188 0.0000000
}
\dontrun{If you extract the complete cases (n=500) to a matrix called
"complete", using}

complete=simNA[!is.na(simNA[,3]),]

\dontrun{then} 
summary(glm(complete[,1]~complete[,3], family="binomial"))

\dontrun{gives the following results:}

\dontrun{Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    0.05258    0.09879   0.532    0.595    
complete[, 3]  1.01942    0.12050   8.460   <2e-16 ***
}

\dontrun{
Notice that the Mean Score estimates above had smaller 
standard errors, reflecting the additional information
in the incomplete observations used in the analysis.
Also note that since z is a surrogate for x, it is not 
used in the complete case analysis.
}
 

\dontrun{THE ECTOPIC DATASET EXAMPLE}

\dontrun{This is a real-data example of an application of Mean Score
to a case-control study of the association between ectopic 
pregnancy and sexually transmitted diseases (see Reilly and 
Pepe, 1995). To learn more about the dataset, type help(ectopic). 

The data frame called "ectopic" is in the data subfolder
of the meanscore library. You can load the data by typing:
}
data(ectopic)

\dontrun{The following lines will reproduce the results presented in Table 3 
of Reilly & Pepe (1995)}

# use gonnorhoea, contracept and sexpatr as auxiliary variables
ectopic.z=ectopic[,3:5]

# the auxiliary variables defined above and the chlamydia antibody status 
# are the predictor variables in the logistic regression model		
ectopic.x=ectopic[,2:5]    

meanscore(x=ectopic.x,z=ectopic.z,y=ectopic[,1])

}

\seealso{
\code{\link[meanscore]{ms.nprev}},\code{\link[meanscore]{coding}},
\code{\link[meanscore]{ectopic}},\code{\link[meanscore]{simNA}},\code{\link{glm}}.
}

\references{

Reilly,M and M.S. Pepe. 1995. A mean score method for missing and auxiliary \cr
             covariate data in regression models. \emph{Biometrika} \bold{82:}299-314
}

\keyword{regression}
	

\eof

\name{ms.nprev}
\alias{ms.nprev}
\title{Logistic regression of two-stage data using second stage sample 
       and first stage sample sizes or proportions (prevalences) as input}
	   

\description{Weighted logistic regression using the Mean Score method \cr

\bold{BACKGROUND}

This algorithm will analyse the second stage data from a two-stage
design, incorporating as appropriate weights the first stage sample
sizes in each of the strata defined by the first-stage variables.
If the first-stage sample sizes are unknown, you can still get
estimates (but not standard errors) using estimated relative 
frequencies (prevalences)of the strata. To ensure that the sample
sizes or prevalences are provided in the correct order, it is 
advisable to first run the \code{\link[meanscore]{coding}} function.
}

\usage{

ms.nprev(x=x,y=y,z=z,n1="option",prev="option",factor=NULL,print.all=FALSE)
}

\arguments{

REQUIRED ARGUMENTS
\item{x}{matrix of predictor variables for regression model} 
\item{y}{response variable (should be binary 0-1)}
\item{z}{matrix of any surrogate or auxiliary variables which must be categorical , \cr 

and one of the following:}
\item{n1}{vector of the first stage sample sizes 
 for each (y,z) stratum: must be provided
 in the correct order (see \code{\link[meanscore]{coding}} function) \cr
OR}

\item{prev}{vector of the first-stage or population
 	  proportions (prevalences) for each (y,z) stratum:
          must be provided in the correct order 
          (see \code{\link[meanscore]{coding}} function) \cr 
	  

OPTIONAL ARGUMENTS}

\item{print.all}{logical value determining all output to be printed. 
		 The default is False (F).} 
\item{factor}{factor variables; if the columns of the matrix of
	  predictor variables have names, supply these names, 
	  otherwise supply the column numbers. MS.NPREV will fit 
	  separate coefficients for each level of the factor variables.}

}

\value{

If called with \code{prev} will return only:

	  A list called "table" containing the following:

\item{ylevel}{the distinct values (or levels) of y}
\item{zlevel}{the distinct values (or levels) of z}
\item{prev}{the prevalences for each \code{(ylevel,zlevel)} stratum}
\item{n2}{the sample sizes at the second stage in each stratum 
	  defined by \code{(ylevel,zlevel)} \cr

	  and a list called "parameters" containing:}

\item{est}{the Mean score estimates of the coefficients in the
	  logistic regression model \cr \cr
	
If called with \code{n1} it will return:

	  a list called "table" containing:}

\item{ylevel}{the distinct values (or levels) of y}
\item{zlevel}{the distinct values (or levels) of z}
\item{n1}{the sample size at the first stage in each \code{(ylevel,zlevel)} stratum}
\item{n2}{the sample sizes at the second stage in each stratum 
	  defined by \code{(ylevel,zlevel)} \cr

	  and a list called "parameters" containing:}

\item{est}{the Mean score estimates of the coefficients in the
	  logistic regression model}	
\item{se}{the standard errors of the Mean Score estimates}
\item{z}{Wald statistic for each coefficient}
\item{pvalue}{2-sided p-value (H0: coeff=0) \cr \cr

If print.all=TRUE, the following lists will also be returned:}

\item{Wzy}{the weight matrix used by the mean score algorithm,
	   for each \code{(ylevel,zlevel)} stratum: this will be in the same order 
	   as n1 and prev} 	
\item{varsi}{the variance of the score in each \code{(ylevel,zlevel)} stratum}
\item{Ihat}{the Fisher information matrix} 
}		   
               
\details{

	The response, predictor and surrogate variables 
	have to be numeric. If you have multiple columns of 
	z, say (z1,z2,..zn), these will be recoded into
      a single vector \code{new.z}

\tabular{rrrr}{
	z1 \tab z2 \tab z3 \tab new.z \cr
	0 \tab 0 \tab 0 \tab 1 \cr
	1 \tab	0 \tab 0 \tab 2 \cr
	0 \tab 1 \tab 0 \tab 3 \cr
	1 \tab 1 \tab 0 \tab 4 \cr
	0 \tab	0 \tab 1 \tab 5 \cr
	1 \tab 0 \tab 1 \tab 6 \cr
	0 \tab	1 \tab 1 \tab 7 \cr
	1 \tab 1 \tab 1 \tab 8 \cr
	}

	If some of the value combinations do not exist 
	in your data, the function will adjust accordingly. 
	For example if the combination (0,1,1) is absent,
	then (1,1,1) will be coded as 7.
}

\examples{

\dontrun{As an illustrative example, we use a simulated data set, simNA.
Use} 

data(simNA)        #to load the data
\dontrun{and}
help(simNA)        #for details


\dontrun{The "complete cases" (i.e. second-stage data) can be extracted by:}

complete=simNA[!is.na(simNA[,3]),]

\dontrun{Running a logistic regression analysis on the complete data:}

summary(glm(complete[,1]~complete[,3], family="binomial"))


\dontrun{gives the following result

Call:
glm(formula = complete[, 1] ~ complete[, 3], family = "binomial")

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    0.05258    0.09879   0.532    0.595    
complete[, 3]  1.01942    0.12050   8.460   <2e-16 ***
}

\dontrun{The first and second stage sample sizes can be viewed by running
the "coding" function (see help(coding) for details)
}

coding(x=simNA[,3], y=simNA[,1], z=simNA[,2])
\dontrun{which gives the following:

 [1] "For calls to ms.nprev, input n1 or prev in the following order!!"
     ylevel z new.z  n1  n2
[1,]      0 0     0 310 150
[2,]      0 1     1 166  85
[3,]      1 0     0 177  86
[4,]      1 1     1 347 179
}

\dontrun{An analysis of all first- and second-stage data using Mean Score:}

# supply the first stage sample sizes in the correct order
n1=c(310,166,177,347)
ms.nprev(x=complete[,3],z=complete[,2],y=complete[,1],n1=n1)

\dontrun{gives the results:
[1] "please run coding function to see the order in which you"
[1] "must supply the first-stage sample sizes or prevalences"
[1] " Type ?coding for details!"
[1] "For calls to ms.nprev,input n1 or prev in the following order!!"
     ylevel z new.z  n2
[1,]      0 0     0 150
[2,]      0 1     1  85
[3,]      1 0     0  86
[4,]      1 1     1 179
[1] "Check sample sizes/prevalences"
$table
     ylevel zlevel  n1  n2
[1,]      0      0 310 150
[2,]      0      1 166  85
[3,]      1      0 177  86
[4,]      1      1 347 179

$parameters
                  est         se          z    pvalue
(Intercept) 0.0493998 0.07155138  0.6904103 0.4899362
x           1.0188437 0.10187094 10.0013188 0.0000000
}

\dontrun{If we supply the prevalances instead of first stage sample sizes}
p1=c(310,166,177,347)/1000
ms.nprev(x=complete[,3],z=complete[,2],y=complete[,1],prev=p1)

\dontrun{we get the output:

      ylevel zlevel  prev  n2
[1,]      0      0 0.310 150
[2,]      0      1 0.166  85
[3,]      1      0 0.177  86
[4,]      1      1 0.347 179

$parameters
                   est
(Intercept) 0.04939797
x           1.01885599
}


\dontrun{Note that the Mean Score algorithm produces smaller 
standard errors of estimates than the complete-case
analysis, due to the additional information in the
incomplete cases.}
}

\seealso{
\code{\link[meanscore]{meanscore}},\code{\link[meanscore]{coding}},
\code{\link[meanscore]{ectopic}},\code{\link[meanscore]{simNA}},\code{\link{glm}}.
}

\references{

	Reilly,M and M.S. Pepe. 1995. A mean score method for 
		missing and auxiliary covariate data in 
		regression models. \emph{Biometrika} \bold{82:}299-314
}
	
\keyword{regression}


	

\eof
\name{simNA}
\alias{simNA}
\title{Simulated dataset for illustrating the meanscore function}


\description{

For this dataset, we generated 1000 observations of the 
predictor variable (X) from the standard normal distribution.
The response variable(Y) was then generated as a Bernoulli
random variable with \bold{p=exp(x)/(1+exp(x))}

A dichotomous surrogate variable for X, called Z, was 
generated as follows: 

	\eqn{Z=1, X >0} \cr
	\eqn{  0, otherwise}

We randomly deleted 500 of the X values (replacing them with NA),
and stored the data in the matrix simNA, described below.
}

\usage{data(simNA)}

\format{

There are 3 columns in the dataset. \cr

Column 1 is the response variable (Y), \cr

Column 2 is the surrogate variable (Z) \cr

Column 3 is the predictor variable (X) \cr

}

\keyword{datasets}

\eof
