textcat               package:textcat               R Documentation

_N-_G_r_a_m _B_a_s_e_d _T_e_x_t _C_a_t_e_g_o_r_i_z_a_t_i_o_n

_D_e_s_c_r_i_p_t_i_o_n:

     Categorize texts by finding the closest n-gram reference profile.

_U_s_a_g_e:

     textcat(x, p = ECIMCI_profiles, method = "CT")

_A_r_g_u_m_e_n_t_s:

       x: a character vector, or an object coercible to this using
          'as.character'.

       p: a textcat profile db (see 'textcat_profile_db').

  method: a character string specifying a built-in method, or a
          used-defined function for computing distances between n-gram
          profiles.  See *Details* for available built-in methods.

_D_e_t_a_i_l_s:

     Currently, the following distance methods are available.

     '"_C_T"': the out-of-place measure of Cavnar and Trenkle.

     '"_r_a_n_k_s"': a variant of the Cavnar/Trenkle measure based on the
          aggregated absolute difference of the ranks of the combined
          n-grams in the given text and the reference profile.

     '"_A_L_P_D"': the sum of the absolute differences in n-gram log
          frequencies.

     '"_K_L_I"': the Kullback-Leibler I-divergence I(p, q) = sum_i p_i
          log(p_i/q_i) of the n-gram frequency distributions p and q of
          the given text and the reference profile.

     '"_K_L_J"': the Kullback-Leibler J-divergence J(p, q) = sum_i (p_i -
          q_i) log(p_i/q_i), the symmetrized variant I(p, q) + I(q, p)
          of the I-divergences.

     '"_J_S"': the Jensen-Shannon divergence between the n-gram frequency
          distributions.

     For the measures based on distances of frequency distributions,
     n-grams in the text and the reference profile are combined, and
     missing n-grams are given a small positive absolute frequency
     (currently, 1e-6).

     For each given text, its n-gram profile is computed using the
     options in the reference profile db.  Then, the distance between
     the profile and the reference profiles is computed, and the text
     is categorized into the category of the closest profile (if this
     is not unique, 'NA' is obtained).

     Unless the profile db uses bytes rather than characters, the texts
     in 'x' should be encoded in UTF-8.

_R_e_f_e_r_e_n_c_e_s:

     W. B. Cavnar and J. M. Trenkle (1994), N-Gram-Based Text
     Categorization. In ``Proceedings of SDAIR-94, 3rd Annual Symposium
     on Document Analysis and Information Retrieval'', 161-175.

_E_x_a_m_p_l_e_s:

     textcat(c("This is an english sentence.",
               "Das ist ein deutscher satz."))

