\name{budget}
\alias{budget}
\title{Optimal sampling design for 2-stage studies with fixed budget}

\description{
Optimal design for two-stage-study with budget constraint
using the Mean Score method \cr

\bold{BACKGROUND} \cr

This function calculates the total number of study observations
and the second-stage sampling fractions that will maximise precision 
subject to an available budget. The user must also supply the unit
cost of observations at the first and second stage, and the vector
of prevalences in each of the strata defined by different levels 
of dependent variable and first stage covariates . 

Before running the \code{budget} function you should run the \code{coding}
function, to see in which order you must supply the vector of
prevalences. (see help (\code{\link[twostage]{coding}}) for details) 

}


\usage{

budget (x=x,y=y,z=z,factor=NULL,prev=prev,var="var",b=b,c1=c1,c2=c2)
}

\arguments{

REQUIRED ARGUMENTS
\item{x}{matrix of predictor variables} 
\item{y}{response variable (binary 0-1)}
\item{z}{matrix of the first stage variables which must be categorical
	 (can be more than one column)}
\item{prev}{vector of estimated prevalences for each (y,z) stratum}
\item{var}{The name of the predictor variable whose coefficient is to be optimised. 
	    See \bold{DETAILS} if this is a factor variable}	 
\item{b}{the total budget available}
\item{c1}{the cost per first stage observation}
\item{c2}{the cost per second stage observation \cr

OPTIONAL ARGUMENTS}
\item{factor}{the names of any factor variables in the predictor matrix}
}

\value{
The following lists will be returned:

\item{\bold{n}}{the optimal number of observations (first stage sample size)}

\item{\bold{se}}{the standard error of estimates achieved by the optimal design \cr

\bold{and a list called \code{design} consisting of the following items:}}

\item{ylevel}{the different levels of the response variable}
\item{zlevel}{the different levels of first stage covariates z.}
\item{prev}{the prevalence of each (\code{ylevel},\code{zlevel}) stratum}
\item{n2}{the sample size of pilot observations for each (\code{ylevel},\code{zlevel}) stratum}
\item{prop}{optimal 2nd stage sampling proportion for each (\code{ylevel},\code{zlevel}) stratum}
\item{samp.2nd}{optimal 2nd stage sample size for each (\code{ylevel},\code{zlevel}) stratum}

}

\details{
	The response, predictor and first stage variables 
	have to be numeric. If you have multiple columns of 
	z, say (z1,z2,..zn), these will be recoded into
        a single vector \code{new.z}

	\tabular{rrrr}{
	z1 \tab z2 \tab z3 \tab new.z \cr
	0 \tab 0 \tab 0 \tab 1 \cr
	1 \tab	0 \tab 0 \tab 2 \cr
	0 \tab 1 \tab 0 \tab 3 \cr
	1 \tab 1 \tab 0 \tab 4 \cr
	0 \tab	0 \tab 1 \tab 5 \cr
	1 \tab 0 \tab 1 \tab 6 \cr
	0 \tab	1 \tab 1 \tab 7 \cr
	1 \tab 1 \tab 1 \tab 8 \cr
	}


	If some of the value combinations do not exist 
	in your data, the function will adjust accordingly. 
	For example if the combination (0,1,1) is absent,
	then (1,1,1) will be coded as 7. \cr

	If you wish to optimise the coefficient of a factor variable, 
      you need to specify which level of the variable to optimise. 
	For example, if "weight" is a factor variable with 3 categories
	1,2 and 3 then var="weight2" will optimise the estimation of the
	coefficient which measures the difference between weight=2 and
	the baseline (weight=1). By default the baseline is always 
	the category with the smallest value. \cr
	}

\examples{

\dontrun{We give an example using the pilot subsample from the CASS data 
discussed in Reilly(1996). The data are in the cass2 matrix, which can be loaded using}

data(cass2)

\dontrun{and a description of the dataset can be seen using}

help(cass2)

\dontrun{In our examples below, we use sex and weight as auxiliary variables.

Given an available budget of 10,000, a first-stage cost of  1/unit
and second-stage cost  0.5/unit, the codes below will calculate
the sampling strategy that optimises the precision of the 
coefficient for SEX : see output below.}

data(cass2)
y=cass2[,1]            #response variable
z=cass2[,10]           #auxiliary variable
x=cass2[,c(2,4:9)]     #predictor variables

# run CODING function to see in which order we should enter prevalences
coding(x=x,y=y,z=z)	
# supplying the prevalence (from Table 5, Reilly 1996)

prev=c(0.0197823937,0.1339020772,0.6698813056,0.0544015826,
+ 0.0503214639,0.0467359050,0.0009891197,0.0040801187,0.0127349159,
+ 0.0022255193,0.0032146390,0.0017309594)

# optimise sex coefficient

budget(x=x,y=y,z=z,var="sex",prev=prev,b=10000,c1=1,c2=0.5)


\dontrun{OUTPUT

[1] "please run coding function to see the order in which you"
[1] "must supply the first-stage sample sizes or prevalences"
[1] " Type ?coding for details!"
[1] "For calls requiring n1 or prev as input, use the following order"
      ylevel z new.z n2
 [1,]      0 1     1 10
 [2,]      0 2     2 10
 [3,]      0 3     3 10
 [4,]      0 4     4 10
 [5,]      0 5     5 10
 [6,]      0 6     6 10
 [7,]      1 1     1  8
 [8,]      1 2     2 10
 [9,]      1 3     3 10
[10,]      1 4     4 10
[11,]      1 5     5 10
[12,]      1 6     6 10
[1] "Check sample sizes/prevalences"
$n
[1] 9166

$design
      ylevel zlevel         prev n2   prop samp.2nd
 [1,]      0      1 0.0197823937 10 0.5230       95
 [2,]      0      2 0.1339020772 10 0.2841      349
 [3,]      0      3 0.6698813056 10 0.0726      446
 [4,]      0      4 0.0544015826 10 0.4488      224
 [5,]      0      5 0.0503214639 10 0.2480      114
 [6,]      0      6 0.0467359050 10 0.4922      211
 [7,]      1      1 0.0009891197  8 1.0000        9
 [8,]      1      2 0.0040801187 10 1.0000       37
 [9,]      1      3 0.0127349159 10 1.0000      117
[10,]      1      4 0.0022255193 10 1.0000       20
[11,]      1      5 0.0032146390 10 1.0000       29
[12,]      1      6 0.0017309594 10 1.0000       16

$se
                   [,1]
(Intercept) 1.193504705
sex         0.217235702
weight      0.006718422
age         0.014588813
angina      0.245831383
chf         0.077039239
lve         0.010071151
surg        0.179887419}
}
\seealso{
\code{\link[twostage]{ms.nprev}},\code{\link[twostage]{fixed.n}},
\code{\link[twostage]{precision}},\code{\link[twostage]{cass1}},
\code{\link[twostage]{cass2}},\code{\link[twostage]{coding}}
}

\references{
	Reilly,M and M.S. Pepe. 1995. A mean score method for 
	missing and auxiliary covariate data in 
	regression models. \emph{Biometrika} \bold{82:}299-314 \cr

	Reilly,M. 1996. Optimal sampling strategies for 
		two-stage studies. \emph{Amer. J. Epidemiol.} 
		\bold{143:}92-100

}

\keyword{design}

\eof
\name{cass1}
\alias{cass1}
\title{The CASS pilot dataset with sex as auxiliary covariate}

\description{ 

This is a pilot dataset from the Coronary Artery Surgery Study (CASS) 
and is discussed in Reilly (1996). It consists of a random sample
of 25 subjects from each stratum defined by different levels of
mortality and sex. The "first-stage" data from which this pilot
sample was chosen had 8096 observations. There are three variables
in the pilot dataset (see Data Description).
}

\format{
The dataset has 100 observations with 3 variables arranged in the
following columns: \cr

Column 1 (Mortality) \cr
The operative mortality of the patients \cr
(0 = Alive, 1 = Dead) \cr

Column 2 (age) \cr

Column 3 (sex) \cr
(0 = Male, 1 = Female) \cr

}

\source{
	Vliestra, et.al. (1980)
}

\seealso{
\code{\link[twostage]{ms.nprev}},\code{\link[twostage]{fixed.n}},
\code{\link[twostage]{budget}},\code{\link[twostage]{precision}},
\code{\link[twostage]{cass2}},\code{\link[twostage]{coding}}
}

\references{

Reilly,M. 1996. Optimal sampling strategies for two-stage studies. 
\emph{Amer. J. Epidemiol.} \bold{143:}92-100


Vliestra,R.E.,Frye R.L., Kromnal R.A.,et.al.(1980). Risk factors and angiographic
coronary artery disease: a report from the Coronary Artery Surgery Study (CASS).
\emph{Circulation}\bold{ 62:254-61}

}

\keyword{datasets}







\eof
\name{cass2}
\alias{cass2}
\title{The CASS pilot dataset with sex and categorical weight 
	as auxiliary variables}

\description{

This is a pilot dataset from the Coronary Artery Surgery Study (CASS) 
register (Reilly, 1996). It consists of a random sample of 10
observations from each of the strata defined by different levels
of mortality, sex and weight category. The "first-stage" data from
which this pilot sample was chosen had 8096 observations. There are 
ten variables in the pilot dataset (see FORMAT).
}

\format{
The dataset has 118 observations with 10 variables arranged in the
following columns: \cr

Column 1 (Mortality) \cr
The operative mortality of the patients \cr
(0 = Alive, 1 = Dead) \cr

Column 2 (sex) \cr
(0 = Male, 1 = Female) \cr

Column 3 (Categorical Weight) \cr
1 = weight less than 60 kg \cr
2 = weight between 60-70 kg \cr
3 = weight 70 kg or more \cr


Column 4 (weight) \cr
The actual weight (in Kg) of the patients at the time of bypass surgery \cr

Column 5 (age) \cr
Age of patients at the time of bypass surgery \cr

Column 6 (Unstable angina) \cr
The angina status of the patients at the time of bypass surgery \cr
(0 = Stable, 1 = Unstable) \cr

Column 7 (Congestive Heart Failure Score (CHF) score) \cr

Column 8 (Left ventricular end diastolic blood pressure (LVEDBP)) \cr

Column 9 (Urgency of Surgery) \cr
(0 = Not urgent, 1 = Urgent) \cr

Column 10 (the auxiliary variables, Z) \cr
Different values describe the different levels of sex and weight 
category \cr

1 = Male; weight less than 60 kg \cr
2 = Male; weight between 60-70 kg \cr
3 = Male; weight 70 kg or more\cr
4 = Female; weight less than 60 kg \cr
5 = Female; weight between 60-70 kg \cr
6 = Female; weight 70 kg or more\cr

}

\source{
	Vliestra, et.al. (1980)
}

\seealso{
\code{\link[twostage]{msnprev}},\code{\link[twostage]{fixed.n}},
\code{\link[twostage]{budget}},\code{\link[twostage]{precision}},
\code{\link[twostage]{cass1}},\code{\link[twostage]{cass2}},\code{\link[twostage]{coding}}
}

\references{

Reilly,M. 1996. Optimal sampling strategies for two-stage studies. 
\emph{Amer. J. Epidemiol.} \bold{143:}92-100

Vliestra,R.E.,Frye R.L., Kromnal R.A.,et.al.(1980). Risk factors and angiographic
coronary artery disease: a report from the Coronary Artery Surgery Study (CASS).
\emph{Circulation}\bold{ 62:254-61}

}

\keyword{datasets}




\eof

\name{coding}
\alias{coding}
\title{combines two or more surrogate/auxiliary variables into a vector} 


\description{

   recodes a matrix of categorical variables into a vector which takes 
   a unique value for each combination \cr


\bold{BACKGROUND}

From the matrix Z of first-stage covariates, this function creates 
a vector which takes a unique value for each combination as follows:

	\tabular{rrrr}{
	z1 \tab z2 \tab z3 \tab new.z \cr
	0 \tab 0 \tab 0 \tab 1 \cr
	1 \tab 0 \tab 0 \tab 2 \cr
	0 \tab 1 \tab 0 \tab 3 \cr
	1 \tab 1 \tab 0 \tab 4 \cr
	0 \tab 0 \tab 1 \tab 5 \cr
	1 \tab 0 \tab 1 \tab 6 \cr
	0 \tab 1 \tab 1 \tab 7 \cr
	1 \tab 1 \tab 1 \tab 8 \cr
	}

If some of the combinations do not exist, the function will adjust
accordingly: for example if the combination (0,1,1) is absent above,
then (1,1,1) will be coded as 7. \cr

The values of this new.z are reported as \code{new.z} in the printed output 
(see \code{value} below) \cr

This function should be run on second stage data prior to using
the ms.nprev function, as it illustrates the order in which the call
to ms.nprev expects the first-stage sample sizes to be provided.
}


\usage{
coding(x=x,y=y,z=z,return=FALSE)
}

\arguments{
REQUIRED ARGUMENTS
\item{x}{matrix of predictor variables for regression model} 
\item{y}{response variable (should be binary 0-1)}
\item{z}{matrix of any surrogate or auxiliary variables which must be categorical \cr


OPTIONAL ARGUMENTS}

\item{return}{logical value; if it's TRUE(T) the original surrogate
	  or auxiliary variables and the re-coded auxilliary 
	  variables will be returned.   
	  The default is FALSE. 
}
}
\value{
This function does not return any values \bold{except} if \code{return}=T. \cr

If used with only second stage (i.e. complete) data, it will print the 
following:
\item{ylevel}{the distinct values (or levels) of response variable}
\item{\eqn{\bold{z}1 \dots \bold{z}i}}{the distinct values of first stage variables 
\eqn{\bold{z}1 \dots \bold{z}i}}
\item{new.z}{recoded first stage variables. Each value represents a unique combination of 
first stage variable values.}
\item{n2}{second stage sample sizes in each (\code{ylevel},\code{new.z}) stratum. \cr

If used with combined first and second stage data (i.e. with NA for 
missing values), in addition to the above items, the function will also print the following:}

\item{n1}{first-stage sample sizes in each (\code{ylevel},\code{new.z}) stratum.}

}

\examples{

\dontrun{The CASS2 data set in Reilly (1996) has 2 categorical first-stage 
variables in columns 2 (sex) and 3 (categorical weight). The predictor 
variables are  column 2 (sex) and columns 4-9 and the response variable 
is in column 1 (mort). See help(cass2) for further details. 

The commands}
data(cass2)
coding(x=cass2[,c(2,4:9)],y=cass2[,1], z=cass2[,2:3])

\dontrun{give the following coding scheme and first-stage and second-stage 
sample sizes (n1 and n2 respectively)

[1] "For calls requiring n1 or prev as input, use the following order"
      ylevel sex wtcat new.z n2
 [1,]      0   0     1     1 10
 [2,]      0   1     1     2 10
 [3,]      0   0     2     3 10
 [4,]      0   1     2     4 10
 [5,]      0   0     3     5 10
 [6,]      0   1     3     6 10
 [7,]      1   0     1     1  8
 [8,]      1   1     1     2 10
 [9,]      1   0     2     3 10
[10,]      1   1     2     4 10
[11,]      1   0     3     5 10
[12,]      1   1     3     6 10
}
}
\references{
	Reilly,M. 1996. Optimal sampling strategies for 
		        two-stage studies. \emph{Amer. J. Epidemiol.} 
		        \bold{143:}92-100
}
\seealso{
\code{\link[twostage]{ms.nprev}},\code{\link[twostage]{fixed.n}},\code{\link[twostage]{budget}}
\code{\link[twostage]{precision}}, \code{\link[twostage]{cass1}},\code{\link[twostage]{cass2}}
}

\keyword{utilities}

\eof
\name{fixed.n}
\alias{fixed.n}
\title{Optimal second stage sampling fractions, subject to
	fixed sample sizes at the first and second stage}



\description{
Optimal second stage sampling fractions (and sample sizes) using
mean score method in \bold{logistic regression setting}, 
based on first-stage sample sizes and pilot second-stage data as input.

Optimality is with respect to the standard error of a coefficient 
of interest, specified in the call to the function. \cr

\bold{BACKGROUND} \cr

This function gives the optimal second stage sampling fractions
(and sample sizes) for applications where a first-stage sample
(of size n) has already been gathered and the size of sample to
be gathered at the second stage is also fixed. Such a situation
might arise where outcome data (Y) and some covariates (Z) are
available on a database, and it is decided to pursue additional
variables on a subsample of subjects, where the size of the 
subsample is determined by time/cost considerations (an example 
would be the testing of stored bloods for a new marker which has 
been discovered since an initial case-control study was done). \cr
Since the first-stage data is available, the count (or proportion)
of first-stage observations in each (Z,Y) stratum can be computed,
and one of these vectors must be provided in the call to the "fixed.n" 
function. \cr
The optimal second-stage sampling fractions can also be found
for the situation where the first-stage data is NOT available
provided we specify the ratio of second stage sample size 
to first stage sample size (i.e the overall sampling fraction
at the second stage), and estimates of prevalences of the
(Z,Y) strata in the population. However, this situation is 
likely to be rare compared to the first scenario above.\cr

Before running the \code{fixed.n} function you should run the \code{coding}
function, to see in which order you must supply the vector of
prevalences. (see help (\code{\link[twostage]{coding}}) for details) 
}

\usage{

fixed.n (x=x,y=y,z=z,factor=NULL,n2=n2,var="var",n1="option",prev="option",frac="option")

}

\arguments{

REQUIRED ARGUMENTS

\item{x}{matrix of predictor variables} 
\item{y}{response variable (binary 0-1)}
\item{z}{matrix of the first stage variables which must be categorical
	  (can be more than one column)}
\item{n2}{size of second stage sample}
\item{var}{The name of the predictor variable whose coefficient is to be optimised. 
	    See \bold{DETAILS} if this is a factor variable \cr	 
\bold{and one of the following:}}
\item{n1}{vector of the first stage sample sizes for each (y,z) stratum \cr

OR}

\item{prev}{vector of estimated prevalences for each 
    	  (y,z) stratum, AND}
\item{frac}{the second stage sampling fraction i.e., the ratio of second stage sample 
		size to first stage sample size 
(NOTE: if \code{prev} is given, \code{frac} will also be required) \cr


OPTIONAL ARGUMENTS}
\item{factor}{the names of any factor variables in the predictor matrix}
}

\value{

\bold{A list called \code{design} consisting of the following items:}

\item{ylevel}{the different levels of response variable}
\item{zlevel}{the different levels of first stage variables z.}
\item{n1}{the first stage sample size for each (\code{ylevel},\code{zlevel}) stratum}
\item{n2}{the sample size of pilot observations for each (\code{ylevel},\code{zlevel}) stratum}
\item{prop}{optimal 2nd stage sampling proportion for each (\code{ylevel},\code{zlevel}) stratum}
\item{samp.2nd}{optimal 2nd stage sample size for each (\code{ylevel},\code{zlevel}) stratum \cr

\bold{and a list called \code{se} containing:}}
	
\item{se}{the standard errors of estimates achieved by the optimal design.}
}

\details{

	The response, predictor and first stage variables 
	have to be numeric. If you have multiple columns of 
	z, say (z1,z2,..zn), these will be recoded into
      a single vector \code{new.z}. These \code{new.z} values are
	reported as \code{zlevel} in the output (see \code{value}).


	\tabular{rrrr}{
	z1 \tab z2 \tab z3 \tab new.z \cr
	0 \tab 0 \tab 0 \tab 1 \cr
	1 \tab	0 \tab 0 \tab 2 \cr
	0 \tab 1 \tab 0 \tab 3 \cr
	1 \tab 1 \tab 0 \tab 4 \cr
	0 \tab	0 \tab 1 \tab 5 \cr
	1 \tab 0 \tab 1 \tab 6 \cr
	0 \tab	1 \tab 1 \tab 7 \cr
	1 \tab 1 \tab 1 \tab 8 \cr
	}


	If some of the value combinations do not exist 
	in your data, the function will adjust accordingly. 
	For example if the combination (0,1,1) is absent,
	then (1,1,1) will be coded as 7. \cr

	If you wish to optimise the coefficient of a factor variable, 
      you need to specify which level of the variable to optimise. 
	For example, if "weight" is a factor variable with 3 categories
	1,2 and 3 then var="weight2" will optimise the estimation of the
	coefficient which measures the difference between weight=2 and
	the baseline (weight=1). By default the baseline is always 
	the category with the smallest value. \cr
}

\examples{
\dontrun{This example of computing second stage sampling fractions subject
to a fixed total second-stage sample size uses the CASS data 
(Reilly, 1996). Once the TWOSTAGE library has been attached,
this data can be made available by:
}
data(cass1)

\dontrun{and a detailed description of the data can be obtained by} 

help (cass1) 

\dontrun{In this example, we suppose that the CASS registry only has available
the mortality(Y) and sex(Z) for the 8096 "first-stage" subjects. The pilot
data consists of 25 observations from each (Y,Z) stratum, where the sizes of
the strata are (see Reilly 1996):
	Y	Z	N
	0	0	6666
	0	1	1228
	1	0	144
	1	1	58
We wish to use this pilot information to compute the optimal design to 
minimise the variance of the sex coefficient in a logistic model 
with Sex and Age as predictors . Assume that we wish to sample a total
of 1000 subjects at the second stage.

The following commands give the output below:
}

data(cass1)
y=cass1[,1]     #--- the response variable is mortality
z=cass1[,3]     #--- the auxiliary variable is sex
x=cass1[,2:3] #--- the variables in the model are sex and age

# run CODING function to see in which order we should enter n1
coding(x=x,y=y,z=z)	
#supplying the first stage sample sizes
n1=c(6666, 1228, 144, 58)
 
# variable to be optimised (in our case sex)
fixed.n(x=x,y=y,z=z,n1=n1,var="sex",n2=1000)

\dontrun{
will give us the following output 
[1] "please run coding function to see the order in which you"
[1] "must supply the first-stage sample sizes or prevalences"
[1] " Type ?coding for details!"
[1] "For calls requiring n1 or prev as input, use the following order"
     ylevel z new.z n2
[1,]      0 0     0 25
[2,]      0 1     1 25
[3,]      1 0     0 25
[4,]      1 1     1 25
[1] "Check sample sizes/prevalences"
$design
     ylevel zlevel   n1 n2   prop samp.2nd
[1,]      0      0 6666 25 0.1128      752
[2,]      0      1 1228 25 0.0375       46
[3,]      1      0  144 25 1.0000      144
[4,]      1      1   58 25 1.0000       58

$se
                  [,1]
(Intercept) 0.55496070
age         0.00956422
sex         0.16472156
}
}

\seealso{
\code{\link[twostage]{ms.nprev}},\code{\link[twostage]{budget}},
\code{\link[twostage]{precision}},\code{\link[twostage]{cass1}},
\code{\link[twostage]{cass2}},\code{\link[twostage]{coding}}
}


\references{
	Reilly,M and M.S. Pepe. 1995. A mean score method for 
	missing and auxiliary covariate data in 
	regression models. \emph{Biometrika} \bold{82:}299-314 \cr

	Reilly,M. 1996. Optimal sampling strategies for 
		two-stage studies. \emph{Amer. J. Epidemiol.} 
		\bold{143:}92-100

}

\keyword{design}

\eof
\name{ms.nprev}
\alias{ms.nprev}
\title{Logistic regression of two-stage data using second stage sample 
       and first stage sample sizes or proportions (prevalences) as input}
	   

\description{Weighted logistic regression using the Mean Score method \cr

\bold{BACKGROUND}

This algorithm will analyse the second stage data from a two-stage
design, incorporating as appropriate weights the first stage sample
sizes in each of the strata defined by the first-stage variables.
If the first-stage sample sizes are unknown, you can still get
estimates (but not standard errors) using estimated relative 
frequencies (prevalences)of the strata. To ensure that the sample
sizes or prevalences are provided in the correct order, it is 
advisable to first run the \code{\link[twostage]{coding}} function.
}

\usage{

ms.nprev(x=x,y=y,z=z,n1="option",prev="option",factor=NULL,print.all=FALSE)
}

\arguments{

REQUIRED ARGUMENTS
\item{x}{matrix of predictor variables for regression model} 
\item{y}{response variable (should be binary 0-1)}
\item{z}{matrix of any surrogate or auxiliary variables which must be categorical, \cr 

and one of the following:}
\item{n1}{vector of the first stage sample sizes 
 for each (y,z) stratum: must be provided
 in the correct order (see \code{\link[twostage]{coding}} function) \cr
OR}

\item{prev}{vector of the first-stage or population
 	  proportions (prevalences) for each (y,z) stratum:
          must be provided in the correct order 
          (see \code{\link[twostage]{coding}} function) \cr 
	  

OPTIONAL ARGUMENTS}

\item{print.all}{logical value determining all output to be printed. 
		 The default is FALSE.} 
\item{factor}{factor variables; if the columns of the matrix of
	  predictor variables have names, supply these names, 
	  otherwise supply the column numbers. MS.NPREV will fit 
	  separate coefficients for each level of the factor variables.}

}

\value{

If called with \code{prev} will return only:

	  \bold{A list called \code{table} containing the following:}

\item{ylevel}{the distinct values (or levels) of y}
\item{zlevel}{the distinct values (or levels) of z}
\item{prev}{the prevalences for each (y,z) stratum}
\item{n2}{the sample sizes at the second stage in each stratum 
	  defined by (y,z) \cr

	  \bold{and a list called \code{parameters} containing:}}

\item{est}{the Mean score estimates of the coefficients in the
	  logistic regression model \cr \cr
	
If called with \code{n1} it will return:

	  \bold{a list called \code{table} containing:}}

\item{ylevel}{the distinct values (or levels) of y}
\item{zlevel}{the distinct values (or levels) of z}
\item{n1}{the sample size at the first stage in each (y,z) stratum}
\item{n2}{the sample sizes at the second stage in each stratum 
	  defined by (y,z) \cr

	  \bold{and a list called \code{parameters} containing:}}

\item{est}{the Mean score estimates of the coefficients in the
	  logistic regression model}	
\item{se}{the standard errors of the Mean Score estimates}
\item{z}{Wald statistic for each coefficient}
\item{pvalue}{2-sided p-value (H0: coeff=0) \cr \cr

If print.all=T, the following lists will also be returned:}

\item{Wzy}{the weight matrix used by the mean score algorithm,
	   for each Y,Z stratum: this will be in the same order 
	   as n1 and prev} 	
\item{varsi}{the variance of the score in each Y,Z stratum}
\item{Ihat}{the Fisher information matrix} 
}		   
               
\details{

	The response, predictor and surrogate variables 
	have to be numeric. If you have multiple columns of 
	z, say (z1,z2,..zn), these will be recoded into
      a single vector \code{new.z}

\tabular{rrrr}{
	z1 \tab z2 \tab z3 \tab new.z \cr
	0 \tab 0 \tab 0 \tab 1 \cr
	1 \tab	0 \tab 0 \tab 2 \cr
	0 \tab 1 \tab 0 \tab 3 \cr
	1 \tab 1 \tab 0 \tab 4 \cr
	0 \tab	0 \tab 1 \tab 5 \cr
	1 \tab 0 \tab 1 \tab 6 \cr
	0 \tab	1 \tab 1 \tab 7 \cr
	1 \tab 1 \tab 1 \tab 8 \cr
	}

	If some of the value combinations do not exist 
	in your data, the function will adjust accordingly. 
	For example if the combination (0,1,1) is absent,
	then (1,1,1) will be coded as 7.
}

\examples{
\dontrun{As an illustrative example, we use the CASS pilot data,"cass1"
from Reilly (1996)

Use }
data(cass1)        #to load the data
\dontrun{and}
help(cass1)        #for details


\dontrun{The first-stage sample sizes are:

	Y	Z	n
	0	0	6666
	0	1	1228
	1	0	144
	1	1	58	

An analysis of the pilot data using Mean Score}

# supply the first stage sample sizes in the correct order
n1=c(6666, 1228, 144, 58)
ms.nprev(y=cass1[,1], x=cass1[,2:3],z=cass1[,3],n1=n1)

\dontrun{gives the results:

[1] "please run coding function to see the order in which you"
[1] "must supply the first-stage sample sizes or prevalences"
[1] " Type ?coding for details!"
[1] "For calls requiring n1 or prev as input, use the following order"
     ylevel z new.z n2
[1,]      0 0     0 25
[2,]      0 1     1 25
[3,]      1 0     0 25
[4,]      1 1     1 25
[1] "Check sample sizes/prevalences"
$table
     ylevel zlevel   n1 n2
[1,]      0      0 6666 25
[2,]      0      1 1228 25
[3,]      1      0  144 25
[4,]      1      1   58 25

$parameters
                    est         se         z       pvalue
(Intercept) -5.06286163 1.46495235 -3.455991 0.0005482743
age          0.02166536 0.02584049  0.838427 0.4017909402
sex          0.67381300 0.21807878  3.089769 0.0020031236
}

\dontrun{Note that the Mean Score algorithm produces smaller 
standard errors of estimates than the complete-case
analysis, due to the additional information in the
incomplete cases.}
}

\seealso{
\code{\link[twostage]{fixed.n}},\code{\link[twostage]{budget}},\code{\link[twostage]{precision}}
\code{\link[twostage]{coding}},\code{\link[twostage]{cass1}},\code{\link[twostage]{cass2}}
}

\references{
	Reilly,M and M.S. Pepe. 1995. A mean score method for 
	missing and auxiliary covariate data in 
	regression models. \emph{Biometrika} \bold{82:}299-314 \cr

	Reilly,M. 1996. Optimal sampling strategies for 
		two-stage studies. \emph{Amer. J. Epidemiol.} 
		\bold{143:}92-100

}
	
\keyword{regression}

	

\eof
\name{precision}
\alias{precision}
\title{Optimal sampling design for 2-stage studies with fixed precision}

\description{
Optimal design for two-stage-study with fixed variance
of estimates using the Mean Score method \cr

\bold{BACKGROUND} \cr

This function calculates the total number of study observations
and the second-stage sampling fractions that will minimise the
study cost subject to a fixed variance for a specified coefficient. 
The user must also supply the unit cost of observations at the
first and second stage, and the vector of prevalences in each 
of the strata defined by different levels of dependent variable 
and first stage covariates . 

Before running this function you should run the \code{coding} function, 
to see in which order you must supply the vector of
prevalences. For details, type help(\code{\link[twostage]{coding}})
}

\usage{

precision (x=x,y=y,z=z,factor=NULL,var="var",prev=prev,prc=prc,c1=c1,c2=c2)
}

\arguments{

REQUIRED ARGUMENTS
\item{x}{matrix of predictor variables} 
\item{y}{response variable (binary 0-1)}
\item{z}{matrix of the first stage variables which must be categorical
	 (can be more than one column)}
\item{prev}{vector of estimated prevalences for each (y,z) stratum}
\item{var}{the name of the predictor variable whose coefficient is to be optimised. 
	    See \bold{DETAILS} if this is a factor variable}	 
\item{prc}{the fixed variance of \code{var} coefficient}
\item{c1}{the cost per first stage observation}
\item{c2}{the cost per second stage observation \cr

OPTIONAL ARGUMENTS}
\item{factor}{the names of any factor variables in the predictor matrix}
}

\value{

\bold{The following lists will be returned:}

\item{n}{the optimal number of observations (first stage sample size)}

\item{var}{the variance of estimates achieved by the optimal design}

\item{cost}{the minimum study cost \cr


\bold{and a list called \code{design} consisting of the following items:}}

\item{ylevel}{the different levels of response variable}
\item{zlevel}{the different levels of first stage covariates z.}
\item{prev}{the prevalence of each (\code{ylevel},\code{zlevel}) stratum}
\item{n2}{the sample size of pilot observations for each (\code{ylevel},\code{zlevel}) stratum}
\item{prop}{optimal 2nd stage sampling proportion for each (\code{ylevel},\code{zlevel}) stratum}
\item{samp.2nd}{optimal 2nd stage sample size for each (\code{ylevel},\code{zlevel}) stratum}

}

\details{
	The response, predictor and first stage variables 
	have to be numeric. If you have multiple columns of 
	z, say (z1,z2,..zn), these will be recoded into
        a single vector \code{new.z} 

	\tabular{rrrr}{
	z1 \tab z2 \tab z3 \tab new.z \cr
	0 \tab 0 \tab 0 \tab 1 \cr
	1 \tab	0 \tab 0 \tab 2 \cr
	0 \tab 1 \tab 0 \tab 3 \cr
	1 \tab 1 \tab 0 \tab 4 \cr
	0 \tab	0 \tab 1 \tab 5 \cr
	1 \tab 0 \tab 1 \tab 6 \cr
	0 \tab	1 \tab 1 \tab 7 \cr
	1 \tab 1 \tab 1 \tab 8 \cr
	}


	If some of the value combinations do not exist 
	in your data, the function will adjust accordingly. 
	For example if the combination (0,1,1) is absent,
	then (1,1,1) will be coded as 7. \cr

	If you wish to optimise the coefficient of a factor variable, 
      you need to specify which level of the variable to optimise. 
	For example, if "weight" is a factor variable with 3 categories
	1,2 and 3 then var="weight2" will optimise the estimation of the
	coefficient which measures the difference between weight=2 and
	the baseline (weight=1). By default the baseline is always 
	the category with the smallest value. \cr
	}

	
\examples{

\dontrun{This example uses the same CASS dataset (cass2) which is used
in the example of the "budget" function. The data are in the
cass2 matrix, which can be loaded using}

data(cass2)
\dontrun{and a description of the dataset can be seen using}

help(cass2)

\dontrun{In our example below, we use sex and weight as auxiliary variables. 
The commands below will calculate the sampling design which will achieve a 
variance of 0.0472 for the coefficient of SEX subject to 
minimising the study cost. We assume a first-stage cost of  1/unit
and second-stage cost of  0.5/unit,}

data(cass2) 
y=cass2[,1]	             #response variable
z=cass2[,10]             #auxiliary variable
x=cass2[,c(2,4:9)]       #predictor variables in the model


# run CODING function to see in which order we should enter prevalences
coding(x=x,y=y,z=z)	

# supplying the prevalence (from Table 5, Reilly 1996)
prev=c(0.0197823937,0.1339020772,0.6698813056,0.0544015826,
+ 0.0503214639,0.0467359050,0.0009891197,0.0040801187,0.0127349159,
+ 0.0022255193,0.0032146390,0.0017309594)

# optimise SEX coefficient
precision(x=x,y=y,z=z,var="sex",prev=prev,prc=0.0472,c1=1,c2=0.5)

\dontrun{This will give us the following output:

[1] "please run coding function to see the order in which you"
[1] "must supply the first-stage sample sizes or prevalences"
[1] " Type ?coding for details!"
[1] "For calls requiring n1 or prev as input, use the following order"
      ylevel z new.z n2
 [1,]      0 1     1 10
 [2,]      0 2     2 10
 [3,]      0 3     3 10
 [4,]      0 4     4 10
 [5,]      0 5     5 10
 [6,]      0 6     6 10
 [7,]      1 1     1  8
 [8,]      1 2     2 10
 [9,]      1 3     3 10
[10,]      1 4     4 10
[11,]      1 5     5 10
[12,]      1 6     6 10
[1] "Check sample sizes/prevalences"
$n
[1] 9165

$design
      ylevel zlevel         prev n2   prop samp.2nd
 [1,]      0      1 0.0197823937 10 0.5230       95
 [2,]      0      2 0.1339020772 10 0.2841      349
 [3,]      0      3 0.6698813056 10 0.0726      446
 [4,]      0      4 0.0544015826 10 0.4488      224
 [5,]      0      5 0.0503214639 10 0.2480      114
 [6,]      0      6 0.0467359050 10 0.4922      211
 [7,]      1      1 0.0009891197  8 1.0000        9
 [8,]      1      2 0.0040801187 10 1.0000       37
 [9,]      1      3 0.0127349159 10 1.0000      117
[10,]      1      4 0.0022255193 10 1.0000       20
[11,]      1      5 0.0032146390 10 1.0000       29
[12,]      1      6 0.0017309594 10 1.0000       16

$cost
[1] 9998

$var
                    [,1]
(Intercept) 1.424664e+00
sex         4.719827e-02
weight      4.514397e-05
age         2.128650e-04
angina      6.044365e-02
chf         5.935923e-03
lve         1.014436e-04
surg        3.236426e-02 

CHECK: 
Note that the minimum cost obtained is the same as our budget
in the fixed budget problem (10,000), and all the solutions are 
the same except for rounding error. 
}
}

\seealso{
\code{\link[twostage]{ms.nprev}},\code{\link[twostage]{fixed.n}},
\code{\link[twostage]{budget}},\code{\link[twostage]{cass1}},
\code{\link[twostage]{cass2}},\code{\link[twostage]{coding}}
}


\references{
	Reilly,M and M.S. Pepe. 1995. A mean score method for 
	missing and auxiliary covariate data in 
	regression models. \emph{Biometrika} \bold{82:}299-314 \cr

	Reilly,M. 1996. Optimal sampling strategies for 
		two-stage studies. \emph{Amer. J. Epidemiol.} 
		\bold{143:}92-100

}

\keyword{design}

\eof
