library(text2vec) # provides functions for vectorizing text
library(Matrix) # adds support for sparses matrices
library(NNLM) # implements NMF
library(stopwords) # provides lists of stopwords
corpus <- read.csv('../Data/reviews.csv', stringsAsFactors=FALSE)
We remove words that occur in more than half of the reviews or that occur less than 20 times overall. We also remove common English stopwords from the vocabulary, via the stopwords function/package:
iterator <- itoken(corpus$text,
preprocessor=tolower,
tokenizer=word_tokenizer,
progressbar=FALSE)
vocabulary <- create_vocabulary(iterator, stopwords=stopwords(language='en', source='snowball'))
vocabulary <- prune_vocabulary(vocabulary, doc_proportion_max=0.5, term_count_min=20)
nrow(vocabulary)
## [1] 5500
vectorizer <- vocab_vectorizer(vocabulary)
dtm <- create_dtm(iterator, vectorizer)
dim(dtm)
## [1] 2000 5500
We transform the document-term matrix (i.e. dtm):
tfidf <- TfIdf$new()
dtm <- fit_transform(dtm, tfidf)
We set the seed for reproductibility (i.e. so that the pseudo-random initialization of NMF is always the same):
set.seed(42);
decomp <- nnmf(as.matrix(dtm), 50, rel.tol = 1e-5)
dim(decomp$W)
## [1] 2000 50
dim(decomp$H)
## [1] 50 5500
We sort the vocabulary according to the weights for each topic, in decreasing order. The weight distribution for topic \(i\) is given by the row-vector \(H_i\):
for(i in 1:50){
topic <- decomp$H[i, ]
top_words <- vocabulary[order(-topic), ][1:10, ]
cat("Topic", i, "-", paste(top_words$term, collapse = ', '), "\n")
}
## Topic 1 - alien, ripley, aliens, weaver, species, 3, ellie, earth, planet, mib
## Topic 2 - elizabeth, queen, shakespeare, catholic, england, fiennes, upset, geoffrey, joseph, rush
## Topic 3 - tarzan, jane, ape, apes, jungle, gorilla, disney, lost, march, africa
## Topic 4 - melvin, brooks, simon, carol, jack, dog, simon's, nicholson, lemmon, felix
## Topic 5 - scream, horror, killer, urban, 2, slasher, craven, stab, williamson, sidney
## Topic 6 - life, family, love, mother, husband, father, daughter, man, town, wife
## Topic 7 - squad, claire, danes, ribisi, rate, wood, 70s, block, trio, protagonists
## Topic 8 - van, damme, jean, claude, knock, hong, twin, schneider, species, action
## Topic 9 - robocop, troopers, starship, verhoeven, rico, bugs, military, showgirls, dien, johnny
## Topic 10 - sandler, wedding, singer, julia, robbie, deuce, adam, barrymore, nostalgia, romantic
## Topic 11 - waste, words, going, reese, review, lemmon, sutherland, armageddon, comet, lake
## Topic 12 - spice, girls, melanie, emma, moore, grant, richard, songs, world, power
## Topic 13 - stahl, stiller, midnight, heroin, drug, habit, garofalo, ben, week, jerry
## Topic 14 - 10, 8, 7, 4, 9, 5, 6, classic, 3, 1
## Topic 15 - jakob, patch, williams, liar, adams, ghetto, medical, robin, hunting, flubber
## Topic 16 - guido, benigni, beautiful, dora, holocaust, camp, son, italian, life, slapstick
## Topic 17 - bruce, willis, comet, armageddon, cole, asteroid, impact, jackal, deep, footage
## Topic 18 - chucky, plastic, bride, doll, jennifer, possessed, married, crisis, murderer, mid
## Topic 19 - toy, antz, bug's, animation, moses, animated, woody, bugs, disney, allen
## Topic 20 - flynt, larry, vs, harrelson, courtney, freedom, speech, courtroom, norton, court
## Topic 21 - jackie, chan, chan's, fu, hong, kung, tucker, martial, chinese, drunken
## Topic 22 - austin, powers, dr, myers, evil, mike, mini, spy, fat, bastard
## Topic 23 - harry, palmetto, harry's, shue, noir, harrelson, woody, twilight, florida, framed
## Topic 24 - lebowski, dude, coen, bowling, jeff, fargo, bunny, bridges, coens, goodman
## Topic 25 - mulan, disney, dragon, army, disney's, animated, emperor, china, animation, asian
## Topic 26 - bulworth, beatty, political, platt, politics, wrestling, senator, warren, rap, wcw
## Topic 27 - col, bridge, nicholson, japanese, british, river, war, military, shall, officers
## Topic 28 - trek, insurrection, ba, star, ku, picard, enterprise, data, contact, series
## Topic 29 - paulie, bird, marie, mohr, cat, jay, elderly, speech, kids, research
## Topic 30 - ryan, hanks, war, private, spielberg, tom, saving, battle, mail, soldier
## Topic 31 - bad, action, guy, gibson, really, jones, plot, think, know, movies
## Topic 32 - pie, school, kissed, american, biggs, high, sex, football, teen, comedy
## Topic 33 - joe, gorilla, mighty, joe's, pitt, hopkins, meg, volcano, ape, theron
## Topic 34 - batman, schwarzenegger, robin, clooney, schumacher, spawn, arnold, freeze, ivy, blah
## Topic 35 - shrek, donkey, fiona, fairy, princess, myers, murphy, lord, diaz, tale
## Topic 36 - murphy, eddie, goldblum, g, ricky, metro, holy, shopping, kelly, bowfinger
## Topic 37 - godzilla, broderick, park, jurassic, york, ebert, beast, creature, maria, mayor
## Topic 38 - pokemon, animation, psychic, million, bad, qualities, 3, abilities, friends, titles
## Topic 39 - carrey, ace, ventura, carrey's, jim, pet, detective, funny, bat, l
## Topic 40 - matrix, reeves, existenz, keanu, reality, computer, neo, effects, fishburne, virtual
## Topic 41 - nbsp, files, x, carter, television, series, mulder, duchovny, fans, fbi
## Topic 42 - sid, kenneth, carry, jacques, williams, joan, james, sir, charles, connor
## Topic 43 - ship, horizon, event, 1900, titanic, crew, virus, sci, fi, horror
## Topic 44 - truman, truman's, carrey, show, weir, jim, harris, pleasantville, world, ed
## Topic 45 - vampire, vampires, blade, carpenter, crow, woods, snipes, blood, baldwin, jack
## Topic 46 - west, wild, smith, gordon, kline, brenner, branagh, jesse, black, jim
## Topic 47 - jay, bob, silent, damon, smith, hunting, smith's, affleck, kevin, dogma
## Topic 48 - wars, phantom, jedi, lucas, menace, jar, star, obi, anakin, wan
## Topic 49 - 54, shane, myers, studio, phillippe, disco, julie, christopher, mike, campbell
## Topic 50 - mars, ghosts, carpenter, planet, apes, mission, carpenter's, red, society, martian
We sort the reviews according to the weights for each topic. The weight distribution for topic \(j\) is given by the column-vector \(W_{j}\):
j = 3
coefficients <- decomp$W[ ,j]
top_document <- corpus$text[order(-coefficients)[1]]
print(top_document)
## [1] " tarzan and the lost city is one of the most anemic movies to come out in quite a while . not only it is poorly written , badly acted , and generally incompetent in all cinematic areas , it is thoroughly uninspired and insipid . unfortunately , it's not bad in the way great , colossal misfires like heaven's gate ( 1980 ) or ishtar ( 1987 ) were bad . instead , it literally drips off the screen like a movie nobody wanted to be associated with , which begs the question of why it was made in the first place . with all the good scripts lying around hollywood un-produced , how does needless drek like this make its way to the big screen ? of course , tarzan is one of the most filmed characters in all of motion picture history - he has appeared in over forty films , which have ranged from the very good ( 1984's greystoke : the legend of tarzan , lord of the apes ) down to the really bad ( 1981's tarzan , the ape man with bo derek ) . most of these films were just cheapie b-movies made in the thirties and forties , starring ex-olympic athletes and a lot of cutsie chimps . therefore , if another tarzan movie is to be made , one might assume that it would have something new to offer - a different angle , an original storyline , anything to set it apart from all the others . greystoke added a never-before-seen level of realism to the pulpy tale , and even tarzan , the ape man at least had the mis-guided audacity to sexualize the story as a vehicle for bo derek's bare breasts . tarzan and the lost city , on the other hand , has absolutely nothing to offer but a bunch of recycled storylines and bad dialogue . the script , by bayard johnson and j . anderson black is about as formulaic and generic as they come . comic books have better plots than this . the movie is so bad , in fact , that it retains that ridiculous tarzan call that was so tirelessly mocked in last summer's comedy george of the jungle . didn't the producers think to leave that back in the old weissmuller pictures where it belongs ? the story starts with the legend of tarzan already firmly established : a quick opening narration tells of tarzan ( casper van dien ) being found in the jungle after having been raised by apes , and his return to england where he assumes his greystoke heritage . when the movie starts in 1913 , he is a civilized english gentleman ( without an english accent ) , and he is to marry jane ( jane march ) in less than a week . however , when a wicked archeologist/grave-robber named nigel ravens ( steve waddington ) begins hunting for the fabled lost city of opar , one of africa's last great secrets , the witch doctor of an ancient african tribe summons tarzan back to the jungle . at first , jane refuses to go , pouting about how it will interfere with their wedding ; but after tarzan leaves she changes her mind and tracks him down , therefore assuring lots of lame smooch scenes between her and her ape-man . once the film gets going ( in its own sluggish way ) , it delves into a series of jungle adventures , as tarzan , jane , and the natives attempt the thwart ravens and his crew from discovering the city . most of the so-called adventures are cheesy , predictable , and unexciting , with no pace , tension , or action to speak of . there are sequences stolen from innumerable recent adventure movies , ranging from raiders of the lost ark ( 1981 ) to the goonies ( 1985 ) . when the movie is running short on action , it includes a few greenpeace-friendly scenes of tarzan freeing caged animals , releasing a baby elephant from a trap , and throwing ivory tusks into the river . the movie is also lacking even a remote hint of reality . for instance , when tarzan - who was raised in the jungle - is bit by a cobra , he doesn't even attempt to suck the venom out like any semi-experienced weekend backpacker would do . instead , he ties a tourniquet around his arm and stumbles off into the jungle with no plan for survival . of course , one can't help but notice how fundamentally misleading the title is . not to ruin the ending or anything , but there is no lost city . there is , however , a lost pyramid , which i suppose is all the resource-strapped fx department could come up with ( the special effects are not worthy of a made-for-tv movie ) . which also brings up the question of why the treasure hunters had to slog through numerous underground caverns to get to the lost pyramid , when it's sitting right out in the middle of an open field ? strictly speaking , tarzan and the lost city isn't even bad enough to have camp quality , although casper van dien's laughably stiff performance comes real close . this movie proves what starship troopers only hinted at : he cannot act , but he sure looks well-groomed , even in the deepest heart of the african jungle . van dien is much too much of a pretty-boy to be an effective tarzan ; he's a calvin klein model in a loin cloth . i also wondered what the make-up department was thinking when it outfitted him with that awful circa-1983 steve perry haircut . waddington makes a decent villain , although he's like a charmless version of belloq from raiders of the lost ark . as jane , the ex-model jane march has little to do but smile and look pretty next to tarzan . she does fire off a gun at the evil treasure hunters a time or two , but whenever a snake comes into the picture , she is reduced to a hysterical mess . however , amidst all this complaining , i do have one piece of good news . tarzan and the lost city is so lacking in ideas both new and old , that it is unable to fill even an hour and a half of celluloid . so , we can say this much for it : at least it had the decency to be short . "
j = 3
coefficients <- decomp$W[ ,j]
top_document <- corpus$text[order(-coefficients)[10]]
print(top_document)
## [1] " you damn dirty apes ! that's just one of the inadvertenty hilarious lines from planet of the apes that's taken on a comedic context over time . no one back then seemed to realize how over-the- top charlton heston's acting style was , but it shows now , particularly in this mystery science theater 3000 wannabe that was taken for a film masterpiece in its time , actually winning one oscar ( for makeup , no less ) and being nominated for a couple others . it also spawned multiple sequels like beneath the planet of the apes , escape from the planet of the apes , return of the planet of the apes , beneath the escape from the return of the planet of the apes , planet of the apes : the next generation , police academy of the apes . . . the list goes on . heston is an american astronaut who spends a few thousand light years in space with his three companions and ends up on a planet not too dissimilar from earth . the thing is , on this planet humans can't talk or think and the guys in the gorilla masks are the dominant species . heston's companions are killed or turned into vegetables by the apon is imprisoned . he surprises them all with his gift of speech , making two primate scientists ( roddy mcdowall and kim hunter ) believe heston is the missing link between ape and man . believe me , we movie critics have been thinking the same thing for years . when the two apes present the idea before a judicial counsel ( the head ape being shakespearean actor maurice evans ) , it is received as heresy , for all good monkeys know god created ape in his image . but heston has already seen a cave that contains evidence that humans were originally the dominant species , before apes ever gained the ability to speak and run for president . and he takes them there , holding up a baby doll and yelling , if humans couldn't speak , then how do you account for this talking doll ? ! and how do you account for your acting ability , mr . heston ? the absolute most laughable scene comes with the movie's surprise conclusion . i won't reveal the details except to say it involves heston falling to his knees on a beach and yelling god damn you all to hell ! several times in succession . the movie is atrocious and should only be viewed by those members of society who like to watch bad movies and laugh at them . what makes planet of the apes even more amusing is that it was supposed to function as some sort of social irony , a condemnation of fundamentals who reject the theories of evolution . but let me tell you , if darwin could see the ape masks and hear the rotten dialogue exchanges ( heston [to female ape] : may i kiss you before i go ? ape : but . . . you're so . . . ugly . ) , he'd convert to creationism on the spot . luckily for us , science-fiction movies have evolved over time to the point at which some of them are actually good . "