lsa                   package:lsa                   R Documentation

_C_r_e_a_t_e _a _v_e_c_t_o_r _s_p_a_c_e _w_i_t_h _L_a_t_e_n_t _S_e_m_a_n_t_i_c _A_n_a_l_y_s_i_s (_L_S_A)

_D_e_s_c_r_i_p_t_i_o_n:

     Calculates a latent semantic space from a given document-term
     matrix.

_U_s_a_g_e:

        lsa( x, dims=dimcalc_share() )

_A_r_g_u_m_e_n_t_s:

       x: a document-term matrix (recommeded to be of class
          textmatrix), containing documents in  colums, terms in rows
          and occurrence frequencies in the cells.

    dims: either the number of dimensions or a configuring function.

_D_e_t_a_i_l_s:

     LSA combines the classical vector space model - well known in 
     textmining - with a Singular Value Decomposition (SVD), a two-mode
      factor analysis. Thereby, bag-of-words representations of texts
     can  be mapped into a modified vector space that is assumed to
     reflect  semantic structure.

     With 'lsa()' a new latent semantic space can be constructed over a
     given document-term matrix. To ease comparisons of terms and
     documents with common correlation measures, the space can be
     converted into a textmatrix of the same format as 'y'  by calling
     'as.textmatrix()'.

     To add more documents or queries to this latent semantic space in
     order to keep them from influencing the original  factor
     distribution (i.e., the latent semantic structure calculated from
     a primary text corpus), they can be `folded-in' later on  (with
     the function 'fold_in()').

     Background information (see also Deerwester et al., 1990): 

     A document-term matrix M is constructed  with 'textmatrix()' from
     a given text base of n documents  containing m terms. This matrix
     M of the size m times n is then decomposed via a singular value
     decomposition into: term vector matrix T (constituting  left
     singular vectors), the document vector matrix D (constituting 
     right singular vectors) being both orthonormal, and the diagonal
     matrix  S (constituting singular values). 

     M = T S t(D)

     These matrices are then reduced to the given number of dimensions
     k=dims to result into truncated matrices Tk, Sk and Dk - the
     latent semantic space. 

     Mk = t[,1:k] s[1:k,1:k] t(d[,1:k])

     If these matrices Tk, Sk, Dk were multiplied, they would give a
     new matrix Mk (of the same format as M, i.e., rows are the same
     terms, columns are the same documents), which is the least-squares
     best  fit approximation of M with k singular values.

     In the case of folding-in, i.e., multiplying new documents into a
     given latent semantic space, the matrices Tk and Sk remain
     unchanged and an additional Dk is created (without replacing the
     old one). All three are multiplied together to return a (new and
     appendable) document-term matrix Mnew in the term-order of M.

_V_a_l_u_e:

LSAspace: a list with components (Tk, Sk, Dk), representing the latent
          semantic space.

_A_u_t_h_o_r(_s):

     Fridolin Wild fridolin.wild@wu-wien.ac.at

_R_e_f_e_r_e_n_c_e_s:

     Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and
     Harshman, R. (1990) _Indexing by Latent Semantic Analysis_. In:
     Journal of the American Society for Information Science 41(6), pp.
     391-407.

     Landauer, T., Foltz, P., and Laham, D. (1998) _Introduction to
     Latent Semantic Analysis_. In: Discourse Processes 25, pp.
     259-284.

_S_e_e _A_l_s_o:

     'as.textmatrix', 'fold_in', 'textmatrix', 'gw_idf',
     'dimcalc_share'

_E_x_a_m_p_l_e_s:

     # create some files
     td = tempfile()
     dir.create(td)
     write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/") )
     write( c("ham", "mouse", "sushi"), file=paste(td, "D2", sep="/") )
     write( c("dog", "pet", "pet"), file=paste(td, "D3", sep="/") )

     # LSA
     data(stopwords_en)
     myMatrix = textmatrix(td, stopwords=stopwords_en)
     myMatrix = lw_logtf(myMatrix) * gw_idf(myMatrix)
     myLSAspace = lsa(myMatrix, dims=dimcalc_share())
     as.textmatrix(myLSAspace)

     # clean up
     unlink(td, recursive=TRUE)

