--- title: "Lab 2 - Unsupervised text classification" author: "Adrien Guille" date: "10/10/2018" output: html_document --- ```{r} library(text2vec) # provides functions for vectorizing text library(Matrix) # adds support for sparses matrices library(NNLM) # implements NMF library(stopwords) # provides lists of stopwords ``` # Part 2 - Unsupervised text classification with R ## Load the corpus ```{r} corpus <- read.csv('../Data/reviews.csv', stringsAsFactors=FALSE) ``` ## Compute the pruned vocabulary We remove words that occur in more than half of the reviews or that occur less than 20 times overall. We also remove common English stopwords from the vocabulary, via the *stopwords* function/package: ```{r} iterator <- itoken(corpus$text, preprocessor=tolower, tokenizer=word_tokenizer, progressbar=FALSE) vocabulary <- create_vocabulary(iterator, stopwords=stopwords(language='en', source='snowball')) vocabulary <- prune_vocabulary(vocabulary, doc_proportion_max=0.5, term_count_min=20) nrow(vocabulary) ``` ## Vectorize the corpus ```{r} vectorizer <- vocab_vectorizer(vocabulary) dtm <- create_dtm(iterator, vectorizer) dim(dtm) ``` ## Apply tf-idf weighting We transform the document-term matrix (i.e. dtm): ```{r} tfidf <- TfIdf$new() dtm <- fit_transform(dtm, tfidf) ``` ## Compute the decomposition via NMF with 50 topics We set the seed for reproductibility (i.e. so that the pseudo-random initialization of NMF is always the same): ```{r} set.seed(42); decomp <- nnmf(as.matrix(dtm), 50, rel.tol = 1e-5) dim(decomp$W) dim(decomp$H) ``` ## Print the top words for each topic We sort the vocabulary according to the weights for each topic, in decreasing order. The weight distribution for topic $i$ is given by the row-vector $H_i$: ```{r} for(i in 1:50){ topic <- decomp$H[i, ] top_words <- vocabulary[order(-topic), ][1:10, ] cat("Topic", i, "-", paste(top_words$term, collapse = ', '), "\n") } ``` ## Print the top documents for topic 3 ### Print the first document for topic 3 We sort the reviews according to the weights for each topic. The weight distribution for topic $j$ is given by the column-vector $W_{j}$: ```{r} j = 3 coefficients <- decomp$W[ ,j] top_document <- corpus$text[order(-coefficients)[1]] print(top_document) ``` ### Print the 10th document for topic 3 ```{r} j = 3 coefficients <- decomp$W[ ,j] top_document <- corpus$text[order(-coefficients)[10]] print(top_document) ```