Adult                 package:arules                 R Documentation

_A_d_u_l_t _D_a_t_a _S_e_t

_D_e_s_c_r_i_p_t_i_o_n:

     The 'AdultUCI' data set contains the questionnaire data of the
     "Adult" database (originally called the "Census Income" Database)
     formatted as a data.frame.  The 'Adult' data set contains the data
     already prepared and coerced to 'transactions' for use with
     'arules'.

_U_s_a_g_e:

     data("Adult")
     data("AdultUCI")

_F_o_r_m_a_t:

     The 'AdultUCI' data set contains a data frame with 48842
     observations on the following 15 variables.

     _a_g_e a numeric vector.

     _w_o_r_k_c_l_a_s_s a factor with levels 'Federal-gov', 'Local-gov',
          'Never-worked', 'Private', 'Self-emp-inc',
          'Self-emp-not-inc', 'State-gov', and 'Without-pay'.

     _e_d_u_c_a_t_i_o_n an ordered factor with levels 'Preschool' < '1st-4th' <
          '5th-6th' < '7th-8th' < '9th' < '10th' < '11th' < '12th' <
          'HS-grad' < 'Prof-school' < 'Assoc-acdm' < 'Assoc-voc' <
          'Some-college' < 'Bachelors' < 'Masters' < 'Doctorate'. 

     _e_d_u_c_a_t_i_o_n-_n_u_m a numeric vector.

     _m_a_r_i_t_a_l-_s_t_a_t_u_s a factor with levels 'Divorced',
          'Married-AF-spouse', 'Married-civ-spouse',
          'Married-spouse-absent', 'Never-married', 'Separated', and
          'Widowed'.

     _o_c_c_u_p_a_t_i_o_n a factor with levels 'Adm-clerical', 'Armed-Forces',
          'Craft-repair', 'Exec-managerial', 'Farming-fishing',
          'Handlers-cleaners', 'Machine-op-inspct', 'Other-service',
          'Priv-house-serv', 'Prof-specialty', 'Protective-serv',
          'Sales', 'Tech-support', and 'Transport-moving'.

     _r_e_l_a_t_i_o_n_s_h_i_p a factor with levels 'Husband', 'Not-in-family',
          'Other-relative', 'Own-child', 'Unmarried', and 'Wife'.

     _r_a_c_e a factor with levels 'Amer-Indian-Eskimo',
          'Asian-Pac-Islander', 'Black', 'Other', and 'White'. 

     _s_e_x a factor with levels 'Female' and 'Male'.

     _c_a_p_i_t_a_l-_g_a_i_n a numeric vector.

     _c_a_p_i_t_a_l-_l_o_s_s a numeric vector.

     _f_n_l_w_g_t a numeric vector.

     _h_o_u_r_s-_p_e_r-_w_e_e_k a numeric vector.

     _n_a_t_i_v_e-_c_o_u_n_t_r_y a factor with levels 'Cambodia', 'Canada', 'China',
          'Columbia', 'Cuba', 'Dominican-Republic', 'Ecuador',
          'El-Salvador', 'England', 'France', 'Germany', 'Greece',
          'Guatemala', 'Haiti', 'Holand-Netherlands', 'Honduras',
          'Hong', 'Hungary', 'India', 'Iran', 'Ireland', 'Italy',
          'Jamaica', 'Japan', 'Laos', 'Mexico', 'Nicaragua',
          'Outlying-US(Guam-USVI-etc)', 'Peru', 'Philippines',
          'Poland', 'Portugal', 'Puerto-Rico', 'Scotland', 'South',
          'Taiwan', 'Thailand', 'Trinadad&Tobago', 'United-States',
          'Vietnam', and 'Yugoslavia'.

     _i_n_c_o_m_e an ordered factor with levels 'small' < 'large'.

_D_e_t_a_i_l_s:

     The "Adult" database was extracted from the census bureau database
     found at <URL: http://www.census.gov/ftp/pub/DES/www/welcome.html>
     in 1994 by Ronny Kohavi and Barry Becker, Data Mining and
     Visualization, Silicon Graphics. It was originally used to predict
     whether income exceeds USD 50K/yr based on census data. We added
     the attribute 'income' with levels 'small' and 'large' (>50K).

     We prepared the data set for association mining as shown in the 
     section Examples. We removed the continuous attribute 'fnlwgt'
     (final weight). We also eliminated 'education-num' because it is
     just a numeric representation of the attribute 'education'. The
     other 4 continuous attributes we mapped to ordinal attributes as
     follows:

     _a_g_e cut into levels  'Young' (0-25), 'Middle-aged' (26-45),
          'Senior' (46-65) and 'Old' (66+).

     _h_o_u_r_s-_p_e_r-_w_e_e_k cut into levels 'Part-time' (0-25), 'Full-time'
          (25-40), 'Over-time' (40-60) and 'Too-much' (60+).

     _c_a_p_i_t_a_l-_g_a_i_n _a_n_d _c_a_p_i_t_a_l-_l_o_s_s each cut into levels 'None' (0),
          'Low' (0 < median of the values greater zero < max) and
          'High' (>=max).

_S_o_u_r_c_e:

     <URL: http://www.ics.uci.edu/~mlearn/MLRepository.html>

_R_e_f_e_r_e_n_c_e_s:

     Blake, C.L. & Merz, C.J. (1998): UCI Repository of Machine
     Learning Databases. Irvine, CA: University of California,
     Department of Information and Computer Science.

     The data set was first cited in Kohavi, R. (1996):  Scaling Up the
     Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid. 
     _Proceedings of the Second International Conference on Knowledge
     Discovery and Data Mining_.

_E_x_a_m_p_l_e_s:

     data("AdultUCI")
     dim(AdultUCI)
     AdultUCI[1:2,]

     ## remove attributes
     AdultUCI[["fnlwgt"]] <- NULL
     AdultUCI[["education-num"]] <- NULL

     ## map metric attributes
     AdultUCI[[ "age"]] <- ordered(cut(AdultUCI[[ "age"]], c(15,25,45,65,100)),
       labels = c("Young", "Middle-aged", "Senior", "Old"))

     AdultUCI[[ "hours-per-week"]] <- ordered(cut(AdultUCI[[ "hours-per-week"]],
       c(0,25,40,60,168)),
       labels = c("Part-time", "Full-time", "Over-time", "Workaholic"))

     AdultUCI[[ "capital-gain"]] <- ordered(cut(AdultUCI[[ "capital-gain"]],
       c(-Inf,0,median(AdultUCI[[ "capital-gain"]][AdultUCI[[ "capital-gain"]]>0]),
       Inf)), labels = c("None", "Low", "High"))

     AdultUCI[[ "capital-loss"]] <- ordered(cut(AdultUCI[[ "capital-loss"]],
       c(-Inf,0, median(AdultUCI[[ "capital-loss"]][AdultUCI[[ "capital-loss"]]>0]),
       Inf)), labels = c("None", "Low", "High"))

     ## create transactions
     Adult <- as(AdultUCI, "transactions")
     Adult

