spam              package:ElemStatLearn              R Documentation

_E_m_a_i_l _S_p_a_m _D_a_t_a

_D_e_s_c_r_i_p_t_i_o_n:

     SPAM E-mail Database. See Details below.

_U_s_a_g_e:

     data(spam)

_F_o_r_m_a_t:

     A data frame with 4601 observations on the following 58 variables.

     _A._1 a numeric vector

     _A._2 a numeric vector

     _A._3 a numeric vector

     _A._4 a numeric vector

     _A._5 a numeric vector

     _A._6 a numeric vector

     _A._7 a numeric vector

     _A._8 a numeric vector

     _A._9 a numeric vector

     _A._1_0 a numeric vector

     _A._1_1 a numeric vector

     _A._1_2 a numeric vector

     _A._1_3 a numeric vector

     _A._1_4 a numeric vector

     _A._1_5 a numeric vector

     _A._1_6 a numeric vector

     _A._1_7 a numeric vector

     _A._1_8 a numeric vector

     _A._1_9 a numeric vector

     _A._2_0 a numeric vector

     _A._2_1 a numeric vector

     _A._2_2 a numeric vector

     _A._2_3 a numeric vector

     _A._2_4 a numeric vector

     _A._2_5 a numeric vector

     _A._2_6 a numeric vector

     _A._2_7 a numeric vector

     _A._2_8 a numeric vector

     _A._2_9 a numeric vector

     _A._3_0 a numeric vector

     _A._3_1 a numeric vector

     _A._3_2 a numeric vector

     _A._3_3 a numeric vector

     _A._3_4 a numeric vector

     _A._3_5 a numeric vector

     _A._3_6 a numeric vector

     _A._3_7 a numeric vector

     _A._3_8 a numeric vector

     _A._3_9 a numeric vector

     _A._4_0 a numeric vector

     _A._4_1 a numeric vector

     _A._4_2 a numeric vector

     _A._4_3 a numeric vector

     _A._4_4 a numeric vector

     _A._4_5 a numeric vector

     _A._4_6 a numeric vector

     _A._4_7 a numeric vector

     _A._4_8 a numeric vector

     _A._4_9 a numeric vector

     _A._5_0 a numeric vector

     _A._5_1 a numeric vector

     _A._5_2 a numeric vector

     _A._5_3 a numeric vector

     _A._5_4 a numeric vector

     _A._5_5 a numeric vector

     _A._5_6 a numeric vector

     _A._5_7 a numeric vector

     _s_p_a_m Factor w/ 2 levels "email", "spam"

_D_e_t_a_i_l_s:

     The "spam" concept is diverse: advertisements for products/web
     sites, make money fast schemes, chain letters, pornography... Our
     collection of spam e-mails came from our postmaster and 
     individuals who had filed spam.  Our collection of non-spam 
     e-mails came from filed work and personal e-mails, and hence the
     word 'george' and the area code '650' are indicators of  non-spam.
      These are useful when constructing a personalized  spam filter. 
     One would either have to blind such non-spam  indicators or get a
     very wide collection of non-spam to  generate a general purpose
     spam filter.

     For background on spam: Cranor, Lorrie F., LaMacchia, Brian A. 
     Spam!  Communications of the ACM, 41(8):74-83, 1998.

     Attribute Information: The last column of 'spambase.data' denotes
     whether the e-mail was  considered spam (1) or not (0), i.e.
     unsolicited commercial e-mail.   Most of the attributes indicate
     whether a particular word or character was frequently occuring in
     the e-mail.  The run-length attributes (55-57) measure the length
     of sequences of consecutive  capital letters.  For the statistical
     measures of each attribute,  see the end of this file.  Here are
     the definitions of the attributes:

     48 continuous real [0,100] attributes of type word_freq_WORD  =
     percentage of words in the e-mail that match WORD, i.e. 100 *
     (number of times the WORD appears in the e-mail) /  total number
     of words in e-mail.  A "word" in this case is any  string of
     alphanumeric characters bounded by non-alphanumeric  characters or
     end-of-string.

     6 continuous real [0,100] attributes of type char_freq_CHAR =
     percentage of characters in the e-mail that match CHAR, i.e. 100 *
     (number of CHAR occurences) / total characters in e-mail

     1 continuous real [1,...] attribute of type
     capital_run_length_average = average length of uninterrupted
     sequences of capital letters

     1 continuous integer [1,...] attribute of type
     capital_run_length_longest = length of longest uninterrupted
     sequence of capital letters

     1 continuous integer [1,...] attribute of type
     capital_run_length_total = sum of length of uninterrupted
     sequences of capital letters = total number of capital letters in
     the e-mail

     1 nominal {0,1} class attribute of type spam = denotes whether the
     e-mail was considered spam (1) or not (0),  i.e. unsolicited
     commercial e-mail.

_S_o_u_r_c_e:

     (a) Creators: Mark Hopkins, Erik Reeber, George Forman, Jaap
     Suermondt Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA
     94304 (b) Donor: George Forman (gforman at nospam hpl.hp.com) 
     650-857-7835 (c) Generated: June-July 1999

_R_e_f_e_r_e_n_c_e_s:

     <URL: http://www.ics.uci.edu/~mlearn/MLRepository.html>

_E_x_a_m_p_l_e_s:

     head(str(spam))

