For this lab, we need two packages, text2vec, to compute the vocabulary and vectorize the corpus, and Matrix, to manipulate the sparse matrices generated with text2vec.
library(text2vec)
library(Matrix)
The corpus is a collection of movie reviews written in English. First, we load the content of the CSV file into a data frame:
corpus <- read.csv('../Data/reviews.csv', stringsAsFactors=FALSE)
colnames(corpus)
## [1] "doc_id" "text" "sentiment"
There are three fields in this CSV: doc_id, text and sentiment. For this lab, we only need the content of the reviews. Let’s look at the 50 first characters of the first review:
substr(corpus$text[1], 1, 50)
## [1] "plot : two teen couples go to a church party , dri"
Even though we don’t care about the sentiment of the reviews for now, we can still look at how many positive reviews there are:
cat("There are", nrow(corpus), "reviews, out of which",
nrow(corpus[which(corpus$sentiment=='pos'), ]), "are positive reviews.")
## There are 2000 reviews, out of which 1000 are positive reviews.
We instanciate an iterator to transform the text into a sequence of lowercased unigrams and then compute the vocabulary:
iterator <- itoken(corpus$text,
preprocessor=tolower, # replace capital letters
tokenizer=word_tokenizer, # split the text into unigrams
progressbar=FALSE)
vocabulary <- create_vocabulary(iterator)
n_words <- nrow(vocabulary)
n_tokens <- sum(vocabulary$term_count)
cat("Number of word types:", n_words, "\nNumber of tokens:", n_tokens)
## Number of word types: 42392
## Number of tokens: 1309372
The vocabulary is a table; each row consist of a word (i.e. term), its overall frequency (i.e. term_count) and the number of documents it occurs in (i.e. doc_count):
head(vocabulary)
## Number of docs: 2000
## 0 stopwords: ...
## ngram_min = 1; ngram_max = 1
## Vocabulary:
## term term_count doc_count
## 1: liken 1 1
## 2: injections 1 1
## 3: centrifuge 1 1
## 4: overkilling 1 1
## 5: flossed 1 1
## 6: artillary 1 1
We sort the vocabulary in decreasing order w.r.t word frequency (i.e. term_count) and print the first 10 entries:
ordered_vocabulary <- vocabulary[order(-vocabulary$term_count), ]
head(ordered_vocabulary, 10)
## Number of docs: 2000
## 0 stopwords: ...
## ngram_min = 1; ngram_max = 1
## Vocabulary:
## term term_count doc_count
## 1: the 76562 1999
## 2: a 38104 1996
## 3: and 35576 1998
## 4: of 34123 1998
## 5: to 31937 1997
## 6: is 25195 1995
## 7: in 21821 1994
## 8: that 15129 1957
## 9: it 12352 1935
## 10: as 11378 1920
We get the usual stop-words, which occur in almost all documents.
For the sake of readability, we select the sub-vocabulary of words that occur at most 20 times, then plot the histogram of word frequency:
vocabulary_20 <- vocabulary[which(vocabulary$term_count <= 20), ]
histogram <- hist(vocabulary_20$term_count,
breaks=20,
main='Word frequency distribution',
xlab='Word frequency',
ylab='Frequency of word frequency')
First, we plot word frequency versus word rank (i.e. position in the ordered vocabulary) for the 200 most frequent words:
frequency <- ordered_vocabulary$term_count[1:200]
plot(frequency,
main='Word frequency versus rank',
xlab='Word rank',
ylab='Word frequency')
Then, we plot the same data with logarithmic axes. We observe kind of a straight-line, which is typical of power law relationships:
plot(frequency,
main='Word frequency versus rank',
xlab='Word log-rank',
ylab='Word log-frequency',
log='xy')