For this lab, we need two packages, text2vec, to compute the vocabulary and vectorize the corpus, and Matrix, to manipulate the sparse matrices generated with text2vec.


Part 1 - Analyzing a collection of movie reviews

1 - Load the corpus

The corpus is a collection of movie reviews written in English. First, we load the content of the CSV file into a data frame:

corpus <- read.csv('../Data/reviews.csv', stringsAsFactors=FALSE)
## [1] "doc_id"    "text"      "sentiment"

There are three fields in this CSV: doc_id, text and sentiment. For this lab, we only need the content of the reviews. Let’s look at the 50 first characters of the first review:

substr(corpus$text[1], 1, 50)
## [1] "plot : two teen couples go to a church party , dri"

Even though we don’t care about the sentiment of the reviews for now, we can still look at how many positive reviews there are:

cat("There are", nrow(corpus), "reviews, out of which",
    nrow(corpus[which(corpus$sentiment=='pos'), ]), "are positive reviews.")
## There are 2000 reviews, out of which 1000 are positive reviews.

2 - Compute the vocabulary

We instanciate an iterator to transform the text into a sequence of lowercased unigrams and then compute the vocabulary:

iterator <- itoken(corpus$text,
                   preprocessor=tolower, # replace capital letters
                   tokenizer=word_tokenizer, # split the text into unigrams
vocabulary <- create_vocabulary(iterator)
n_words <- nrow(vocabulary)
n_tokens <- sum(vocabulary$term_count)
cat("Number of word types:", n_words, "\nNumber of tokens:", n_tokens)
## Number of word types: 42392 
## Number of tokens: 1309372

The vocabulary is a table; each row consist of a word (i.e. term), its overall frequency (i.e. term_count) and the number of documents it occurs in (i.e. doc_count):

## Number of docs: 2000 
## 0 stopwords:  ... 
## ngram_min = 1; ngram_max = 1 
## Vocabulary: 
##           term term_count doc_count
## 1:       liken          1         1
## 2:  injections          1         1
## 3:  centrifuge          1         1
## 4: overkilling          1         1
## 5:     flossed          1         1
## 6:   artillary          1         1

Identify the 10 most common words

We sort the vocabulary in decreasing order w.r.t word frequency (i.e. term_count) and print the first 10 entries:

ordered_vocabulary <- vocabulary[order(-vocabulary$term_count), ]
head(ordered_vocabulary, 10)
## Number of docs: 2000 
## 0 stopwords:  ... 
## ngram_min = 1; ngram_max = 1 
## Vocabulary: 
##     term term_count doc_count
##  1:  the      76562      1999
##  2:    a      38104      1996
##  3:  and      35576      1998
##  4:   of      34123      1998
##  5:   to      31937      1997
##  6:   is      25195      1995
##  7:   in      21821      1994
##  8: that      15129      1957
##  9:   it      12352      1935
## 10:   as      11378      1920

We get the usual stop-words, which occur in almost all documents.

Plot the distribution of word frequency

For the sake of readability, we select the sub-vocabulary of words that occur at most 20 times, then plot the histogram of word frequency:

vocabulary_20 <- vocabulary[which(vocabulary$term_count <= 20), ]
histogram <- hist(vocabulary_20$term_count, 
                  main='Word frequency distribution', 
                  xlab='Word frequency', 
                  ylab='Frequency of word frequency')

3 - Plot word frequency versus rank

First, we plot word frequency versus word rank (i.e. position in the ordered vocabulary) for the 200 most frequent words:

frequency <- ordered_vocabulary$term_count[1:200]
     main='Word frequency versus rank', 
     xlab='Word rank', 
     ylab='Word frequency')

Then, we plot the same data with logarithmic axes. We observe kind of a straight-line, which is typical of power law relationships:

     main='Word frequency versus rank', 
     xlab='Word log-rank', 
     ylab='Word log-frequency', 