---
title: "Lab 1 - Manipulating text data with R"
author: "Adrien Guille"
date: "03/10/2018"
output:
  pdf_document: default
  html_document: default
---

For this lab, we need two packages, text2vec, to compute the vocabulary and vectorize the corpus, and Matrix, to manipulate the sparse matrices generated with text2vec.

```{r}
library(text2vec)
library(Matrix)
```

# Part 1 - Analyzing a collection of movie reviews

## 1 - Load the corpus

The corpus is a collection of movie reviews written in English. First, we load the content of the CSV file into a data frame:

```{r}
corpus <- read.csv('../Data/reviews.csv', stringsAsFactors=FALSE)
colnames(corpus)
```

There are three fields in this CSV: doc_id, text and sentiment. For this lab, we only need the content of the reviews. Let's look at the 50 first characters of the first review:

```{r}
substr(corpus$text[1], 1, 50)
```

Even though we don't care about the sentiment of the reviews for now, we can still look at how many positive reviews there are:

```{r}
cat("There are", nrow(corpus), "reviews, out of which",
    nrow(corpus[which(corpus$sentiment=='pos'), ]), "are positive reviews.")
```

## 2 - Compute the vocabulary

We instanciate an iterator to transform the text into a sequence of lowercased unigrams and then compute the vocabulary:

```{r}
iterator <- itoken(corpus$text,
                   preprocessor=tolower, # replace capital letters
                   tokenizer=word_tokenizer, # split the text into unigrams
                   progressbar=FALSE)
vocabulary <- create_vocabulary(iterator)
n_words <- nrow(vocabulary)
n_tokens <- sum(vocabulary$term_count)
cat("Number of word types:", n_words, "\nNumber of tokens:", n_tokens)
```

The vocabulary is a table; each row consist of a word (i.e. term), its overall frequency (i.e. term_count) and the number of documents it occurs in (i.e. doc_count):

```{r}
head(vocabulary)
```

### Identify the 10 most common words

We sort the vocabulary in decreasing order w.r.t word frequency (i.e. term_count) and print the first 10 entries:

```{r}
ordered_vocabulary <- vocabulary[order(-vocabulary$term_count), ]
head(ordered_vocabulary, 10)
```

We get the usual stop-words, which occur in almost all documents.

### Plot the distribution of word frequency

For the sake of readability, we select the sub-vocabulary of words that occur at most 20 times, then plot the histogram of word frequency:

```{r}
vocabulary_20 <- vocabulary[which(vocabulary$term_count <= 20), ]
histogram <- hist(vocabulary_20$term_count, 
                  breaks=20, 
                  main='Word frequency distribution', 
                  xlab='Word frequency', 
                  ylab='Frequency of word frequency')
```

## 3 - Plot word frequency versus rank

First, we plot word frequency versus word rank (i.e. position in the ordered vocabulary) for the 200 most frequent words:

```{r}
frequency <- ordered_vocabulary$term_count[1:200]
plot(frequency, 
     main='Word frequency versus rank', 
     xlab='Word rank', 
     ylab='Word frequency')
```

Then, we plot the same data with logarithmic axes. We observe kind of a straight-line, which is typical of power law relationships:

```{r}
plot(frequency, 
     main='Word frequency versus rank', 
     xlab='Word log-rank', 
     ylab='Word log-frequency', 
     log='xy')
```

## 4 - Fit Zipf's law

Zipf's law models the relationship between the frequency of a word, $f_r$, and its rank, $r$:

$$ 
f_r \simeq f_\text{max}\frac{1}{r^{-k}}
$$
In the log space, it becomes:

$$
\log(f_r) \simeq \log(f_\text{max}) + k \log(r)
$$

### Estimate the parameters of the power law

We estimate the parameters via least-square fitting, via the $\text{lm}$ function:

```{r}
log_frequency <- log(frequency)
log_rank <- log(c(1:200))
model <- lm(log_frequency ~ log_rank)
model$coefficients
```

We get a value of $k$ close to -1, which is typical for English.

### Plot the estimation of frequency vs. rank

We write a function that returns the estimation of log the frequency of a word according to this model, given its rank:

```{r}
estimate_log_frequency <- function(r){
    return(as.numeric(model$coefficients[1]) 
           + as.numeric(model$coefficients[2]) * log(r))
}
```

We plot the estimation in the original space, thus we take the exponential of the estimation: 

```{r}
estimated_frequency <- exp(sapply(c(1:200), estimate_log_frequency))
plot(frequency, 
     main='Word frequency versus rank', 
     xlab='Word rank', 
     ylab='Word frequency')
lines(estimated_frequency, col='red')
```

## 5 - Vectorize the corpus

### Prune the vocabulary

In order to contain the dimension of the document-term matrix (i.e. dtm), we prune the vocabulary:

```{r}
pruned_vocabulary <- prune_vocabulary(vocabulary, 
                                      doc_proportion_max=0.5, 
                                      term_count_min=10)
nrow(pruned_vocabulary)
```

Only 10 000 word types satisfy the pruning conditions.

### Construct the document-term matrix

We instanciate a vetorizer based on the pruned vocabulary and create the document-term matrix using the iterator we've defined previously:

```{r}
vectorizer = vocab_vectorizer(pruned_vocabulary)
dtm = create_dtm(iterator, vectorizer)
```

## 6 - Implement cosine similarity

The cosine similarity between two documents, $d_1$ and $d_2$, is given by:

$$
\text{cosine similarity}(d_1, d_2) = \frac{d_1 \cdot d_2}{||d_1|| ~ ||d_2||}
$$

```{r}
cosine_similarity <- function(d1, d2){
  dot_product <- d1 %*% d2
  norm_prod <- sqrt(sum(d1**2)) * sqrt(sum(d2**2))
  return(as.numeric(dot_product/norm_prod))
}
```

## 7 - Assess the impact of tf-idf weighting

We apply tf-idf weighting to the original document-term matrix:

```{r}
tfidf <- TfIdf$new(smooth_idf = TRUE, sublinear_tf = TRUE)
tfidf_dtm <- tfidf$fit_transform(dtm)
```

We measure the cosine similarity between the first review and the next five ones in the corpus, using both document-term matrices:

```{r}
raw_sim <- numeric(5)
tfidf_sim <- numeric(5)
for(i in 2:6){
  raw_sim[i-1] <- cosine_similarity(dtm[1, ], dtm[i, ])
  tfidf_sim[i-1] <- cosine_similarity(tfidf_dtm[1, ], tfidf_dtm[i, ])
}
```

We sort the five reviews w.r.t their similarity with the first review:

```{r}
order(-raw_sim)
order(-tfidf_sim)
```

Even though the closest review is the same in both cases, the rest of the rankings are quite different.

# Part 2 - Analyzing random text

## 1 - Generate random text following Li's procedure

We generate random text from an alphabet of 4 letters, plus the space character to delimit tokens. The maximum length of a word is fixed to 4 characters. We generate a biased random character sequence according to the probability distribution suggested by Li (p(a)=0.5, p(b)=0.13, p(c)=0.1, p(d)=0.07, p(space)=0.2):

```{r}
M <- 4
alphabet <- letters[1:M]
alphabet[M+1] <- ' '
probabilities <- c(0.5, 0.13, 0.1, 0.07, 0.2)
random_text <- ""
max_word_length <- 4
current_word_length <- 0

generate_random_text <- function(n_char){
  for(i in 1:n_char){
      next_character <- sample(alphabet, 1, prob=probabilities)
      if(next_character == ' '){
          current_word_length <- 0
      }else{
          current_word_length <- current_word_length + 1
      }
      if(current_word_length > max_word_length){
          next_character <- ' '
          current_word_length <- 0
      }
      random_text <- paste(random_text, next_character, sep='')
  }
  return(random_text)
}
```

We generate a sequence of 200 characters to see what a random text looks like: 

```{r}
generate_random_text(200)
```

We generate a sequence of 10 000 characters and compute the vocabulary:

```{r}
random_text <- generate_random_text(10**4)
iterator_r <- itoken(random_text,
                     tokenizer = word_tokenizer)
vocabulary_r <- create_vocabulary(iterator_r)
n_words_r <- nrow(vocabulary_r)
n_tokens_r <- sum(vocabulary_r$term_count)
cat("Number of distinct words: ", 
    n_words_r, 
    "\nNumber of tokens: ", 
    n_tokens_r)
```

## 2 - Fit Zipf's law on this text

As in Part 1, we fit Zipf's law parameters in log-space:

```{r}
frequency_r <- rev(vocabulary_r$term_count)[1:200]
log_frequency_r <- log(frequency_r)
log_rank_r <- log(c(1:200))
model_r <- lm(log_frequency_r ~ log_rank_r)
model_r$coefficients
```

Again, we get a value close to -1. Finally, we plot the estimation on top of the data:

```{r}
estimate_log_frequency_r <- function(r){
    return(as.numeric(model_r$coefficients[1]) 
           + as.numeric(model_r$coefficients[2]) * log(r))
}
estimated_frequency_r <- exp(sapply(c(1:200), estimate_log_frequency_r))
plot(frequency_r, 
     main='Word frequency versus rank (Random text)', 
     xlab='Word rank', 
     ylab='Word frequency')
lines(estimated_frequency_r, col='red')
```