weightings                package:lsa                R Documentation

_W_e_i_g_h_t_i_n_g _S_c_h_e_m_e_s (_M_a_t_r_i_c_e_s)

_D_e_s_c_r_i_p_t_i_o_n:

     Calculates a weighted document-term matrix according to the chosen
     local and/or global weighting scheme.

_U_s_a_g_e:

         lw_tf(m)
         lw_logtf(m)
         lw_bintf(m)
         gw_normalisation(m)
         gw_idf(m)
         gw_gfidf(m)
         entropy(m)
         gw_entropy(m)

_A_r_g_u_m_e_n_t_s:

       m: a document-term matrix.

_D_e_t_a_i_l_s:

     When combining a local and a global weighting scheme to be applied
     on a  given textmatrix 'm' via dtm = lw(m) cdot gw(m), where

        *  m is the given document-term matrix,

        *  lw(m) is one of the local weight functions 'lw_tf()',
           'lw_logtf()', 'lw_bintf()', and

        *  gw(m) is one of the global weight functions
           'gw_normalisation()', 'gw_idf()', 'gw_gfidf()', 'entropy()',
           'gw_entropy()'.

     This set of weighting schemes includes the local weightings (lw)
     raw, log, binary and the global weightings (gw) normalisation, two
     versions of the  inverse document frequency (idf), and entropy in
     both the original Shannon as well as  in a slightly modified, more
     common version:

     'lw_tf()' returns a completely unmodified n times m matrix
     (placebo function).

     'lw_logtf()' returns the logarithmised n times m matrix.
     log(m_{i,j}+1) is applied on every cell.

     'lw_bintf()' returns binary values of the n times m matrix. Every
     cell is assigned 1, iff the term frequency is not equal to 0.

     'gw_normalisation()' returns a normalised n times m matrix. Every
     cell equals 1 divided by the square root of the document vector
     length.

     'gw_idf()' returns the inverse document frequency in a n times m
     matrix. Every cell is 1 plus the logarithmus of the number of
     documents divided by the number of documents where the term
     appears.

     'gw_gfidf()' returns the global frequency multiplied with idf.
     Every cell equals the sum of the frequencies of one term divided
     by the number of documents where the term shows up.

     'entropy()' returns the entropy (as defined by Shannon).

     'gw_entropy()' returns one plus entropy.

_V_a_l_u_e:

     Returns the weighted textmatrix of the same size and format as the
     input matrix.

_A_u_t_h_o_r(_s):

     Fridolin Wild fridolin.wild@wu-wien.ac.at

_R_e_f_e_r_e_n_c_e_s:

     Dumais, S. (1992) _Enhancing Performance in Latent Semantic
     Indexing (LSI) Retrieval_. Technical Report, Bellcore.

     Nakov, P., Popova, A., and Mateev, P. (2001) _Weight functions
     impact on LSA performance_. In: Proceedings of the Recent Advances
     in Natural language processing, Bulgaria, pp.187-193.

     Shannon, C. (1948) _A Mathematical Theory of Communication_. In:
     The Bell System Technical Journal 27(July), pp.379-423.

_E_x_a_m_p_l_e_s:

     ## use the logarithmised term frequency as local weight and 
     ## the inverse document frequency as global weight.

     vec1 = c( 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 )
     vec2 = c( 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0 )
     vec3 = c( 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0 )
     matrix = cbind(vec1,vec2, vec3)
     weighted = lw_logtf(matrix)*gw_idf(matrix)

