---
title: "Lab 2 - Unsupervised text classification"
author: "Adrien Guille"
date: "10/10/2018"
output: html_document
---

```{r}
library(text2vec) # provides functions for vectorizing text
library(Matrix) # adds support for sparses matrices
library(NNLM) # implements NMF
library(stopwords) # provides lists of stopwords
```

# Part 2 - Unsupervised text classification with R

## Load the corpus

```{r}
corpus <- read.csv('../Data/reviews.csv', stringsAsFactors=FALSE)
```

## Compute the pruned vocabulary

We remove words that occur in more than half of the reviews or that occur less than 20 times overall. We also remove common English stopwords from the vocabulary, via the *stopwords* function/package:

```{r}
iterator <- itoken(corpus$text,
                         preprocessor=tolower,
                         tokenizer=word_tokenizer,
                         progressbar=FALSE)
vocabulary <- create_vocabulary(iterator, stopwords=stopwords(language='en', source='snowball'))
vocabulary <- prune_vocabulary(vocabulary, doc_proportion_max=0.5, term_count_min=20)
nrow(vocabulary)
```

## Vectorize the corpus

```{r}
vectorizer <- vocab_vectorizer(vocabulary)
dtm <- create_dtm(iterator, vectorizer)
dim(dtm)
```

## Apply tf-idf weighting

We transform the document-term matrix (i.e. dtm):

```{r}
tfidf <- TfIdf$new()
dtm <- fit_transform(dtm, tfidf)
```

## Compute the decomposition via NMF with 50 topics

We set the seed for reproductibility (i.e. so that the pseudo-random initialization of NMF is always the same):

```{r}
set.seed(42);
decomp <- nnmf(as.matrix(dtm), 50, rel.tol = 1e-5)
dim(decomp$W)
dim(decomp$H)
```

## Print the top words for each topic

We sort the vocabulary according to the weights for each topic, in decreasing order. The weight distribution for topic $i$ is given by the row-vector $H_i$:

```{r}
for(i in 1:50){
  topic <- decomp$H[i, ]
  top_words <- vocabulary[order(-topic), ][1:10, ]
  cat("Topic", i, "-", paste(top_words$term, collapse = ', '), "\n")
}
```

## Print the top documents for topic 3

### Print the first document for topic 3

We sort the reviews according to the weights for each topic. The weight distribution for topic $j$ is given by the column-vector $W_{j}$:

```{r}
j = 3
coefficients <- decomp$W[ ,j]
top_document <- corpus$text[order(-coefficients)[1]]
print(top_document)
```

### Print the 10th document for topic 3

```{r}
j = 3
coefficients <- decomp$W[ ,j]
top_document <- corpus$text[order(-coefficients)[10]]
print(top_document)
```