| keremRand {SHARE} | R Documentation |
This datasets contains the psudo-subjects created from cystic fibrosis data in Kerem et al. (1989).
data(keremRand)
Here is the list of the 23 alleles:
locus_01locus_02locus_03locus_04locus_05locus_06locus_07locus_08locus_09locus_10locus_11locus_12locus_13locus_14locus_15locus_16locus_17locus_18locus_19locus_20locus_21locus_22locus_23SHARE algorithm requires subject-level information, i.e., it needs to know the haplotype/genotype sequences of every subjects in both case and control groups. However, the original data in Kerem et al. (1989) only provide the sequence-level information, meaning that we only know what group (case/control) each haplotype sequence belongs to. We need to simulate subject-level information to demostrate SHARE algorithm. Two haplotypes with the same clinical status (having cystic fibrosis or not) are then ramdonly paired to form a psudo-subject with the that status.
Three objects will be attached after loading the dataset keremRand:
The data.frame object keremRandSeq contains 186 sequences with
23 SNPs. The row names show the subject id and the sequence id within
this subject. The SNPs are coded as 1 referring to the large allele
of the RFLP, and 2 referring to the smaller allele.
The vector object keremRandStatus provides the CF/control
status of each subject. 1 indicates subjects in case group (i.e.,
CF), and 0 indicates control group. There are 47 subjects in CF group
and 46 in control group.
The data.frame object keremRandAllele contains allelic data for
23 SNPs, coded as 0, 1, 2 as the number of minor alleles.
How these three objects were created is shown in the example section.
This dataset was originally released in Kerem et al. (1989), and was converted to R objects in Browning (2006). Browning's dataset could be found in the HapVLMC package (http://www.stat.auckland.ac.nz/~browning/HapVLMC/index.htm).
S. R. Browning. Multilocus association mapping using variable-length markov chains. American Journal of Human Genetics, 78(6):903-913, Jun 2006.
B. Kerem, J. M. Rommens, J. A. Buchanan, D. Markiewicz, T. K. Cox, A. Chakravarti, M. Buchwald, and L. C. Tsui. Identification of the cystic fibrosis gene: genetic analysis. Science (New York, N.Y.), 245(4922):1073-1080, Sep 8 1989.
## Not run:
## Here are how the psudo-subjects are simulated
#### loading HapVLMC package and the dataset
library(HapVLMC)
data(Kerem)
set.seed(20090313)
randOrder <- runif(nrow(kerem.snps.data))
keremRandSeq <- rbind(## randomly order the TRUE part
kerem.snps.data[kerem.status, ][order(randOrder[kerem.status]), ],
## randomly order the FALSE part
kerem.snps.data[!kerem.status, ][order(randOrder[!kerem.status]), ]
)
nLoci <- ncol(keremRandSeq)
lociNum <- unlist(sapply(1:nLoci,
function(x){
paste(paste(
rep("0", ceiling(log10(nLoci)) - nchar(as.character(x))), collapse=""),
x, sep="", collapse="")
})
)
colnames(keremRandSeq) <- paste("locus_", lociNum, sep="")
nSubj <- nrow(keremRandSeq)/2
subjNum <- unlist(sapply(1:nSubj,
function(x){
paste(paste(
rep("0", ceiling(log10(nSubj)) - nchar(as.character(x))), collapse=""),
x, sep="", collapse="")
})
)
subjLabel <- paste("subj_", subjNum, sep="")
seqLabel <- paste("seq", 1:2, sep="_")
rownames(keremRandSeq) <- paste(rep(subjLabel, each=2), seqLabel, sep="_")
keremRandStatus <- c(rep(1, sum(kerem.status)/2), rep(0, sum(!kerem.status)/2))
keremRandAllele <- NULL
for(i in seq(1, nrow(keremRandSeq), by=2)){
keremRandAllele <- rbind(keremRandAllele,
apply(keremRandSeq[c(i, i+1), ], 2,
function(x){
## counting how many small alleles
sum(x==2)
}
)
)
}
rownames(keremRandAllele) <- unique(gsub("^(subj_.*)_seq_(.*)$", "\1", rownames(keremRandSeq)))
## End(Not run)
## load keremRand
data(keremRand)
## check which objects are attached
ls()
## dimention of psedu-subject data
dim(keremRandSeq)
## number of CF (TRUE) and control (FALSE) subjects
table(keremRandStatus)