| predictive.distribution {lda} | R Documentation |
This function takes a fitted LDA-type model and computes a predictive distribution for new words in a document. This is useful for making predictions about held-out words.
predictive.distribution(document_sums, topics, alpha, eta)
document_sums |
A K \times D matrix where each entry is a numeric proportional
to the probability of seeing a topic (row) conditioned on document
(column) (this entry is sometimes denoted \theta_{d,k} in the
literature, see details). Either the document_sums field or
the document_expects field from the output of
lda.collapsed.gibbs.sampler can be used.
|
topics |
A K \times V matrix where each entry is a numeric proportional
to the probability of seeing the word (column) conditioned on topic
(row) (this entry is sometimes denoted \beta_{w,k} in the
literature, see details). The column names should correspond to the
words in the vocabulary. The topics field from the output of
lda.collapsed.gibbs.sampler can be used.
|
alpha |
The scalar value of the Dirichlet hyperparameter for topic proportions. See references for details. |
eta |
The scalar value of the Dirichlet hyperparamater for topic multinomials. See references for details. |
The formula used to compute predictive probability is p_d(w) = \sum_k (\theta_{d, k} + \alpha) (\beta_{w, k} + \eta).
A V \times D matrix of the probability of seeing a word (row) in a document (column). The row names of the matrix are set to the column names of topics.
Jonathan Chang (jcone@princeton.edu)
Blei, David M. and Ng, Andrew and Jordan, Michael. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003.
lda.collapsed.gibbs.sampler for the format of
topics and document_sums and details of the model.
top.topic.words demonstrates another use for a fitted
topic matrix.
## Fit a model (from demo(lda)).
data(cora.documents)
data(cora.vocab)
K <- 10 ## Num clusters
result <- lda.collapsed.gibbs.sampler(cora.documents,
K, ## Num clusters
cora.vocab,
25, ## Num iterations
0.1,
0.1)
## Predict new words for the first two documents
predictions <- predictive.distribution(result$document_sums[,1:2],
result$topics,
0.1, 0.1)
## Use top.topic.words to show the top 5 predictions in each document.
top.topic.words(t(predictions), 5)
## [,1] [,2]
## [1,] "learning" "learning"
## [2,] "algorithm" "paper"
## [3,] "model" "problem"
## [4,] "paper" "results"
## [5,] "algorithms" "system"