MasterBayes           package:MasterBayes           R Documentation

_M_a_x_i_m_u_m _L_i_k_e_l_i_h_o_o_d _a_n_d _M_a_r_k_o_v _c_h_a_i_n _M_o_n_t_e _C_a_r_l_o _m_e_t_h_o_d_s _f_o_r _P_e_d_i_g_r_e_e _R_e_c_o_n_s_t_r_u_c_t_i_o_n, _A_n_a_l_y_s_i_s _a_n_d _S_i_m_u_l_a_t_i_o_n.

_D_e_s_c_r_i_p_t_i_o_n:

     The primary aim of MasterBayes is to use MCMC techniques to
     integrate over uncertainity in pedigree configurations estimated
     from molecular markers and phenotypic data.  Emphasis is put on
     the marginal distribution of parameters that relate the phenotypic
     data to the pedigree.  All simulation is done in compiled C++
     using the Scythe Statistical Library. More detailed information
     can be found in 'vignette("MasterBayes.Tutorial")'.

_D_e_t_a_i_l_s:

     The motivation behind the package is to approximate the following
     probability distribution using Markov chain Monte Carlo
     techniques:


                            p(beta | G, y)


     where beta is the vector of parameters of primary interest, G are
     the genetic data and  y are phenotypic data. Generally, it is not
     possible to simulate from the posterior distribution of beta when
     the problem is in this form and so I augment the parameter space
     with the pedigree, P:


                      int_P  p(beta, P | G, y)dP


     This simplifies the problem because the likelihood can be
     expressed more simply:


              L(G, y | beta, P) = L(G | P)L(y | P beta)


     This simplification rests on the assumption that the genetic and
     non-genetic data are independent after conditioning on the
     pedigree.  This will generally be true when markers are not linked
     to QTL's.  The first likelihood, L(G | P), is easily calculated
     for arbitrary pedigrees using the Elston-Stewart algorithm
     (Elston, 1971), and is based around the Mendelian transition
     probability. The second likelihood is obtained by fitting the
     multinomial log-linear model:


                 L(y | P, beta) = p(P |y, beta)p(P).


     Assuming that the set of possible pedigrees have equal prior
     probability, and that offspring are independently distributed
     after conditioning on the predictor variables: 


 L(y | P, beta) = prod_i^no e^(X_i,p_i beta) / sum_j^np e^(X_i,j beta).


     where X i,j denotes the jth row of offspring i's design matrix
     formed from the phenotypic data y. Each row of the design matrix
     corresponds to a parental combination. no and np denote the number
     of offspring and the number of potential parental combinations,
     respectively. p_i denotes the actual parents of indivdiual i
     (Smouse, 1999).

     The likelihood is taken after averaging over the probability
     distribution of the pedigree, {\bf P}:


                            p(P|G,y,beta).


     Most other techniques approximate this distribution as p(P|G), and
     even then tend to use the mode rather than the complete
     distribution, leading to inferential problems (See the information
     boxes in Hadfield et al. 2006).

     Unfortunately, genotype data are rarely observed with out error
     and the parents of some offspring may not be sampled.  I model
     allelic dropout and stochastic genotyping errors according to a
     model proposed by Wang (2001) when the genetic markers are
     codominant. When the markers are dominant I model the
     probabilities of a dominant allele being miscored as a recessive
     and _vice versa_. Denoting the parameters associated with these
     two forms of genotyping error as E1 and E2, and the vector of
     parental allele frequencies as w, two solutions are implemented.  

     An exact solution:


 int_P int_G int_E1 int_E2 int_w, p(beta, P, G, E1, E2, w, | G_obs, y)dPdGdE1dE2dw


     where the posterior probability distribution of the error rates,
     the allele frequnecies and the true unobserved genotypes, G, are
     estimated and integrated out.  The conditional distribution of the
     true genotypes in the exact form is given by:


                    p(G_obs | G,E1,E2)p(G | P, w).


     The second solution is an approximation to the above equation, and
     uses point estimates for w, E1 and E2. The conditional
     distribution of G is derived ignoring the information present in
     P:


                      p(G_obs | G,E1,E2)p(G | w)


     The approximation can be derived analytically, whereas the exact
     solution requires the Markov chain to be augmented with the true
     genotypes of all individuals.  This becomes very computer
     intensive but the approximation breaks down for dominant markers,
     or models in which the number of unsampled males and/or females is
     to be estimated. Unsampled parents are dealt with, and their
     number estimated using an approximation originally due to Nielsen
     (2001).  An exact solution to the problem has been proposed by
     Emery _et.al._ (2001) but becomes impractical as the number of
     unsampled parents gets large. Nielsen's approximation is based
     around the Mendelian transition probability when a parental
     genotype is unknown.  This probability is derived using estimates
     of the allele frequencies at that locus and the assumption of
     Hardy-Weinberg equilibrium.  

     I deal with the fact that unsampled individuals have missing
     phenotype data by approximating the distribution of the sum of
     linear predictors across unsampled parents.  This approximation
     relies on the assumption that the unsampled indivdiuals come from
     the same statistical popluaion as sampled individuals, and that
     population sizes are large enough so that the distribution for the
     sum tends to a normal distribution under the central limit
     theorem.

     Taking n and N as the number of sampled individuals, and the total
     number of indivdiuals in the population respectively:


     p(sum(p_miss) | p_obs) = N(N-n mean(p_obs), (N S^2)/n(N-n))


     where bf{hat{p}} are vectors of linear predictors for the
     unsampled _miss and sampled  _obs individuals, respectively
     (Gelman _et al._, 2004). S^2 is the sample variance of the
     observed linear predictors.

_A_u_t_h_o_r(_s):

     Jarrod Hadfield j.hadfield@sheffield.ac.uk

_R_e_f_e_r_e_n_c_e_s:

     Elston, R. C. & Stewart, J. Human Heredity (1971) 21 523-542
     Emery, A. M. _et.al_ Molecular Ecology (2001) 10 1265-1278 Gelman,
     A. _et.al_ Bayesian Data Analysis _Edition II_ (2004) Chapman and
     Hall Hadfield J.D. _et al_ (2006) Molecular Ecology 15 3715-31
     Nielsen. R. _et.al_ Genetics (2001) 157 4 1673-1682 Smouse P.E.
     _et al_ (1999) Journal of Evolutionary Biology 12 1069-1077 Wang
     J.L. Genetics (2004) 166 4 1963-1979

_S_e_e _A_l_s_o:

     'MCMCped'

