textmatrix                package:lsa                R Documentation

_T_e_x_t_m_a_t_r_i_x (_M_a_t_r_i_c_e_s)

_D_e_s_c_r_i_p_t_i_o_n:

     Creates a document-term matrix from all textfiles in a given
     directory.

_U_s_a_g_e:

     textmatrix( mydir, stemming=FALSE, language="german", 
        minWordLength=2, minDocFreq=1, stopwords=NULL, vocabulary=NULL )
     textvector( file, stemming=FALSE, language="german", 
        minWordLength=2, minDocFreq=1, stopwords=NULL, vocabulary=NULL )

_A_r_g_u_m_e_n_t_s:

    file: filename (may include path).

   mydir: the directory path (e.g., '"corpus/texts/"').

stemming: boolean indicating whether to reduce all terms to their
          wordstem.

language: specifies language for the stemming / stop-word-removal.

minWordLength: words with less than minWordLength characters will be
          ignored.

minDocFreq: words of a document appearing less than minDocFreq within
          that document will be ignored.

stopwords: a stopword list that contains terms the will be ignored.

vocabulary: if specified, only words in this term list will be used for
          building the matrix (`controlled vocabulary').

_D_e_t_a_i_l_s:

     All documents in the specified directory are read and a matrix is
     composed. The matrix contains in every cell the exact number of
     appearances (i.e., the term frequency)  of every word for all
     documents. If specified, simple text preprocessing mechanisms are
     applied (stemming, stopword filtering, wordlength cutoffs).

     Stemming thereby uses porter's snowball stemmer (from package
     'Rstem').

     There are two stopword lists included (for english and for
     german), which are loaded on demand into the variables
     'stopwords_de' and  'stopwords_en'. They can be activated by
     calling 'data(stopwords_de)' or 'data(stopwords_en)'. Attention:
     the stopword lists have to be already loaded when 'textmatrix()'
     is called.

     'textvector()' is a support function that creates a list of
     term-in-document occurrences.

     For every generated matrix, an own environment is added as an
     attribute which holds the triples that are stored by 'setTriple()'
     and can be retrieved with 'getTriple()'.

_V_a_l_u_e:

textmatrix: the document-term matrix (incl. row and column names).

_A_u_t_h_o_r(_s):

     Fridolin Wild fridolin.wild@wu-wien.ac.at

_S_e_e _A_l_s_o:

     'wordStem', 'stopwords_de', 'stopwords_en', 'setTriple',
     'getTriple'

_E_x_a_m_p_l_e_s:

     # create some files
     td = tempfile()
     dir.create(td)
     write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/") )
     write( c("hamster", "mouse", "sushi"), file=paste(td, "D2", sep="/") )
     write( c("dog", "monster", "monster"), file=paste(td, "D3", sep="/") )

     # read them, create a document-term matrix
     textmatrix(td)

     # read them, drop german stopwords
     data(stopwords_de)
     textmatrix(td, stopwords=stopwords_de)

     # read them based on a controlled vocabulary
     voc = c("dog", "mouse")
     textmatrix(td, vocabulary=voc, minWordLength=1)

     # clean up
     unlink(td, recursive=TRUE)

