textcat_profile_db          package:textcat          R Documentation

_T_e_x_t_c_a_t _P_r_o_f_i_l_e _D_b_s

_D_e_s_c_r_i_p_t_i_o_n:

     Create n-gram profile dbs for text categorization.

_U_s_a_g_e:

     textcat_profile_db(x, id, ...)

_A_r_g_u_m_e_n_t_s:

       x: a character vector of text documents, or an R object of text
          documents extractable via 'as.character'. 

      id: a character vector giving the categories of the texts.
          Recycled to the length of 'x'. 

     ...: further arguments specifying the options used for creating
          the n-gram profiles, see 'textcat_options' for the (current)
          default options.  The names of the arguments are partially
          matched against the names of the defaults, and used for the
          options instead in case of unique matches. 

_D_e_t_a_i_l_s:

     The text documents are split according to the given categories,
     and n-gram profiles are computed via 'textcnt' in package 'tau',
     with options 'n', 'split' and 'useBytes' corresponding to the
     respective arguments, and option 'reduce' setting argument
     'marker' as needed.  N-grams listed in option 'ignore' are
     removed, and only the most frequent remaining ones retained, with
     the maximal number given by option 'size'.  The  options employed
     for building the db are stored in the db.

     There is a 'c' method for combining profile dbs provided that
     these have identical options.

     Unless the profile db uses bytes rather than characters (i.e.,
     option 'bytes' is 'TRUE'), the text documents in 'x' should be
     encoded in UTF-8.

