library(text2vec) # provides functions for vectorizing text
library(Matrix) # adds support for sparses matrices
library(NNLM) # implements NMF
library(stopwords) # provides lists of stopwords

Part 2 - Unsupervised text classification with R

Load the corpus

corpus <- read.csv('../Data/reviews.csv', stringsAsFactors=FALSE)

Compute the pruned vocabulary

We remove words that occur in more than half of the reviews or that occur less than 20 times overall. We also remove common English stopwords from the vocabulary, via the stopwords function/package:

iterator <- itoken(corpus$text,
                         preprocessor=tolower,
                         tokenizer=word_tokenizer,
                         progressbar=FALSE)
vocabulary <- create_vocabulary(iterator, stopwords=stopwords(language='en', source='snowball'))
vocabulary <- prune_vocabulary(vocabulary, doc_proportion_max=0.5, term_count_min=20)
nrow(vocabulary)
## [1] 5500

Vectorize the corpus

vectorizer <- vocab_vectorizer(vocabulary)
dtm <- create_dtm(iterator, vectorizer)
dim(dtm)
## [1] 2000 5500

Apply tf-idf weighting

We transform the document-term matrix (i.e. dtm):

tfidf <- TfIdf$new()
dtm <- fit_transform(dtm, tfidf)

Compute the decomposition via NMF with 50 topics

We set the seed for reproductibility (i.e. so that the pseudo-random initialization of NMF is always the same):

set.seed(42);
decomp <- nnmf(as.matrix(dtm), 50, rel.tol = 1e-5)
dim(decomp$W)
## [1] 2000   50
dim(decomp$H)
## [1]   50 5500