spam                 package:kernlab                 R Documentation

_S_p_a_m _E-_m_a_i_l _D_a_t_a_b_a_s_e

_D_e_s_c_r_i_p_t_i_o_n:

     A data set collected at Hewlett-Packard Labs, that classifies 4601
     e-mails as spam or non-spam. In addition to this class label there
     are 57 variables indicating the frequency of certain words and
     characters in the e-mail.

_U_s_a_g_e:

     data(spam)

_F_o_r_m_a_t:

     A data frame with 4601 observations and 58 variables.

     The first 48 variables contain the frequency of the variable name
     (e.g., business) in the e-mail. If the variable name starts with
     num (e.g., num650) the it indicates the frequency of the
     corresponding number (e.g., 650). The variables 49-54 indicate the
     frequency of the characters `;', `(', `[', `!', `$', and `#'. The
     variables 55-57 contain the average, longest  and total run-length
     of captial letters. Variable 58 indicates the type of the mail and
     is either '"nonspam"' or '"spam"', i.e. unsolicited commercial
     e-mail.

_D_e_t_a_i_l_s:

     The data set contains 2788 e-mails classified as '"nonspam"' and
     1813 classified as '"spam"'.

     The ``spam'' concept is diverse: advertisements for products/web
     sites, make money fast schemes, chain letters, pornography... This
     collection of spam e-mails came from the collectors' postmaster
     and individuals who had filed spam.  The collection of non-spam
     e-mails came from filed work and personal e-mails, and hence the
     word 'george' and the area code '650' are indicators of non-spam. 
     These are useful when constructing a personalized spam filter. 
     One would either have to blind such non-spam indicators or get a
     very wide collection of non-spam to generate a general purpose
     spam filter.

_S_o_u_r_c_e:

        *  Creators: Mark Hopkins, Erik Reeber, George Forman, Jaap
           Suermondt at Hewlett-Packard Labs, 1501 Page Mill Rd., Palo
           Alto, CA 94304

        *  Donor: George Forman (gforman at nospam hpl.hp.com) 
           650-857-7835

     These data have been taken from the UCI Repository Of Machine
     Learning Databases at <URL:
     http://www.ics.uci.edu/~mlearn/MLRepository.html>

_R_e_f_e_r_e_n_c_e_s:

     T. Hastie, R. Tibshirani, J.H. Friedman. _The Elements of
     Statistical Learning._ Springer, 2001.

