| Adult {arules} | R Documentation |
The AdultUCI data set contains the questionnaire data of the
“Adult” database (originally called the “Census Income”
Database) formatted as a data.frame. The Adult data set contains the
data already prepared and coerced to transactions for
use with arules.
data("Adult")
data("AdultUCI")
The AdultUCI data set contains a data frame with 48842
observations on the following 15 variables.
Federal-gov,
Local-gov, Never-worked, Private,
Self-emp-inc, Self-emp-not-inc, State-gov,
and Without-pay.Preschool <
1st-4th < 5th-6th < 7th-8th < 9th <
10th < 11th < 12th < HS-grad <
Prof-school < Assoc-acdm < Assoc-voc <
Some-college < Bachelors < Masters <
Doctorate.Divorced,
Married-AF-spouse, Married-civ-spouse,
Married-spouse-absent, Never-married,
Separated, and Widowed.Adm-clerical,
Armed-Forces, Craft-repair, Exec-managerial,
Farming-fishing, Handlers-cleaners,
Machine-op-inspct, Other-service,
Priv-house-serv, Prof-specialty,
Protective-serv, Sales, Tech-support, and
Transport-moving.Husband,
Not-in-family, Other-relative, Own-child,
Unmarried, and Wife.Amer-Indian-Eskimo,
Asian-Pac-Islander, Black, Other, and
White.Female and Male.Cambodia,
Canada, China, Columbia, Cuba,
Dominican-Republic, Ecuador, El-Salvador,
England, France, Germany, Greece,
Guatemala, Haiti, Holand-Netherlands,
Honduras, Hong, Hungary, India,
Iran, Ireland, Italy, Jamaica,
Japan, Laos, Mexico, Nicaragua,
Outlying-US(Guam-USVI-etc), Peru,
Philippines, Poland, Portugal,
Puerto-Rico, Scotland, South, Taiwan,
Thailand, Trinadad&Tobago, United-States,
Vietnam, and Yugoslavia.small <
large.
The “Adult” database was extracted from the census bureau database
found at http://www.census.gov/ftp/pub/DES/www/welcome.html in 1994 by
Ronny Kohavi and Barry Becker, Data Mining and Visualization, Silicon
Graphics. It was originally used to predict whether income exceeds USD 50K/yr
based on census data. We added the attribute income with levels
small and large (>50K).
We prepared the data set for association mining as shown in the
section Examples. We removed the
continuous attribute fnlwgt (final weight).
We also eliminated education-num because it is just a
numeric representation of the attribute education.
The other 4 continuous attributes we mapped to ordinal attributes as
follows:
Young (0-25),
Middle-aged (26-45),
Senior (46-65) and
Old (66+).Part-time (0-25),
Full-time (25-40),
Over-time (40-60) and
Too-much (60+).None (0),
Low (0 < median of the values greater zero < max) and
High (>=max).http://www.ics.uci.edu/~mlearn/MLRepository.html
Blake, C.L. & Merz, C.J. (1998): UCI Repository of Machine Learning Databases. Irvine, CA: University of California, Department of Information and Computer Science.
The data set was first cited in Kohavi, R. (1996): Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining.
data("AdultUCI")
dim(AdultUCI)
AdultUCI[1:2,]
### remove attributes
AdultUCI[["fnlwgt"]] <- NULL
AdultUCI[["education-num"]] <- NULL
### map metric attributes
AdultUCI[[ "age"]] <- ordered(cut(AdultUCI[[ "age"]], c(15,25,45,65,100)),
labels = c("Young", "Middle-aged", "Senior", "Old"))
AdultUCI[[ "hours-per-week"]] <- ordered(cut(AdultUCI[[ "hours-per-week"]],
c(0,25,40,60,168)),
labels = c("Part-time", "Full-time", "Over-time", "Workaholic"))
AdultUCI[[ "capital-gain"]] <- ordered(cut(AdultUCI[[ "capital-gain"]],
c(-Inf,0,median(AdultUCI[[ "capital-gain"]][AdultUCI[[ "capital-gain"]]>0]),
Inf)), labels = c("None", "Low", "High"))
AdultUCI[[ "capital-loss"]] <- ordered(cut(AdultUCI[[ "capital-loss"]],
c(-Inf,0, median(AdultUCI[[ "capital-loss"]][AdultUCI[[ "capital-loss"]]>0]),
Inf)), labels = c("None", "Low", "High"))
### create transactions
Adult <- as(AdultUCI, "transactions")
Adult