PreEM                package:hapassoc                R Documentation

_P_r_e-_p_r_o_c_e_s_s _t_h_e _d_a_t_a _b_e_f_o_r_e _f_i_t_t_i_n_g _i_t _w_i_t_h _E_M

_D_e_s_c_r_i_p_t_i_o_n:

     This function takes as an argument the original dataframe with
     non-SNP and SNP data and converts the genotype data at single SNPs
     (the single-locus genotypes) into haplotype data. The rows of the
     original data frame should correspond to subjects and each SNP
     should have two columns, one for each allele of the single-locus
     genotype. The SNP data should comprise the last 2*numSNPs columns.
      If the haplotypes for a subject cannot be inferred from his or
     her genotype data, "pseudo-individuals" representing all possible
     haplotype combinations consistent with the single-locus genotypes
     are considered. Missing single-locus genotypes, up to a maximum of
     maxMissingGenos (see below), are allowed, but subjects with
     missing data in more than maxMissingGenos, or with missing non-SNP
     data, are removed. Initial estimates of haplotype frequencies are
     then obtained using the  EM algorithm applied to the multilocus
     genotype data.  Haplotypes with frequencies below a user-specified
     tolerance (zero.tol) are assumed not to exist and are removed from
     further consideration. (Pseudo-individuals having haplotypes of
     negligible frequency are deleted and  the column in the design
     matrix corresponding to that haplotype is deleted.) For the
     remaining haplotypes, those with non-negligible frequency below a 
     user-defined pooling tolerance (pooling.tol) are pooled into a
     single  category called \"pooled\" in the design matrix for the
     risk model.  However, the frequencies of each of these pooled
     haplotypes are  still calculated separately.

_U_s_a_g_e:

     PreEM(dat,numSNPs,maxMissingGenos=1,pooling.tol = 0.05, zero.tol = 1/(2 * nrow(dat) * 10)

_A_r_g_u_m_e_n_t_s:

     dat: the non-SNP and SNP data as a data frame. The SNP data should
          comprise the last 2*numSNPs columns.

 numSNPs: number of SNPs per haplotype

maxMissingGenos: maximum number of single-locus genotypes with missing
          data to allow for each subject. (Subjects with more missing
          data, or with missing non-SNP data are removed.) The default
          is 1.

pooling.tol: pooling tolerance - by default set to 0.05

zero.tol: tolerance for haplotype frequencies below which haplotypes
          are assumed not to exist - by default set to  1/(2*N*10)
          where N is the number of subjects

_V_a_l_u_e:

haplotest: T/F, true if some haplotypes were pooled in the risk model

initGamma: initial estimates of haplotype frequencies

zeroFreqHaplos: list of haplos assumed not to exist

pooledHaplos: list of haplos pooled into a single category in the
          design matrix

nonHaploDM: non-haplotype portion of the AUGMENTED data frame

 haploDM: data frame with 2^numSNPs columns scoring number of copies of
          each haplotype for each pseudo-individual

haploMat: matrix with 2 columns giving haplotypes for each
          pseudo-individual

      wt: vector giving initial weights for each pseudo-individual for 
          the EM algorithm

      ID: index for each individual in the original data frame. Note
          that all pseudo-individuals have the same ID value

 unknown: vector indicating whether the haplotype information was
          missing for each row in the augmented data

_S_e_e _A_l_s_o:

     'EM','summary.EM'.

_E_x_a_m_p_l_e_s:

     data(hypoDat)
     example.preEM<-PreEM(hypoDat, numSNPs=3)

     # To get the initial haplotype frequencies:
     example.preEM$initGamma
     #      h000       h001       h010       h011       h100       h101       h110 
     #0.25179111 0.26050418 0.23606001 0.09164470 0.10133627 0.02636844 0.01081260 
     #      h111 
     #0.02148268 
     # The '001' haplotype is estimated to be the most frequent

     example.preEM$pooledHaplos
     # "h101" "h110" "h111"
     # These haplotypes are to be pooled in the design matrix for the risk model

     names(example.preEM$haploDM)
     # "h000"   "h001"   "h010"   "h011"   "h100"   "pooled"

