Text Analytics
2022-06-17
Session 1 Session 1
1.1 Regex Basics
Some links of interest
This link’s contents are described in what follows. http://www.regular-expressions.info/quickstart.html
This is a regex app for practice. https://regex101.com/
This is a regex primer for R in particular: http://www.regular-expressions.info/rlanguage.html
This primer’s objective is only to learn as much regex as we need for what follows.
Students are encouraged to explore the links above as a starting point for any deeper study.
1.1.1 Metacharacters
Twelve characters have special meanings in regular expressions:
the backslash
\
- escape character for literal reading of metacharacters (see below for examples)the caret
^
- matches at the start of the string; negates a character class (see below for examples)the dollar sign
$
- matches at the end of the stringthe period or dot
.
- matches ANY character (except the newline)the vertical bar or pipe symbol
|
- OR logic gatethe question mark
?
- evaluates if a character class existsthe asterisk or star
*
- zero or more occurrences/ repetitionsthe plus sign
+
- one or more matches/repetitionsthe opening parenthesis
(
the closing parenthesis
)
- these enclose a character set togetherthe opening square bracket
[
the opening curly brace
{
- number of repetitions can be precisely specified
These special characters are often called “metacharacters”. Most of them are errors when used alone.
If you want to use any of these characters as a literal in a regex, you need to escape them with a backslash.
E.g., to match 1+1=2
, the correct regex is 1\+1=2
.
You escape a backslash by using another backslash. E.g., \\s
1.1.2 Character Class
A “character class” matches only one out of several characters.
Use a hyphen inside a character class to specify a range of characters. E.g., [0-9]
matches a single digit between 0 and 9.
Typing a caret (^
) after the opening square bracket negates the character class. E.g., [^0-9]
matches any character except numeric digits.
1.1.3 Shorthand Character Classes
\\d
matches a single character that is a digit,
\\w
matches a “word character” (alphanumeric characters plus underscore), and
\\s
matches a whitespace character (includes tabs and line breaks).
\\t
matches a tab character (ASCII 0x09),
\\r
carriage return (0x0D) and
\\n
for line feed (0x0A).
The dot matches a single character, except line-break characters.
1.1.4 Anchors
Anchors do not match any characters. They match a position.
^
matches at the start of the string, and $ matches at the end of the string.
E.g. ^b
matches only the first b in bob.
\\b
matches at a word boundary. A word boundary is a position between a character that can be matched by \\w
and a character that cannot be matched by \\w
.
1.1.5 Alternation
Alternation is the regular expression equivalent of “or”. E.g., cat|dog
matches cat in About cats and dogs.
Alternation has the lowest precedence of all regex operators. cat|dog
food matches cat or dog food.
To create a regex that matches cat food or dog food, you need to group the alternatives: (cat|dog) food
.
1.1.6 Repetition
The question mark makes the preceding token in the regular expression optional. E.g., colou?r
matches colour or color.
The asterisk or star tells the engine to attempt to match the preceding token zero or more times.
The plus tells the engine to attempt to match the preceding token once or more.
E.g., <[A-Za-z][A-Za-z0-9]*>
matches an HTML tag without any attributes. <[A-Za-z0-9]+>
is easier to write but matches invalid tags such as <1>
.
Use curly braces to specify a specific amount of repetition.
E.g., Use \\b[1-9][0-9]{3}\\b
to match a number between 1000 and 9999.
\\b[1-9][0-9]{2,4}\\b
matches a number between 100 and 99999.
1.2 Regex Exercise
How about a quick exercise, eh?
Go to https://regex101.com/
Now copy-paste the text below into the regex app.
ISB has been ranked 27th in the world in the 2017 Financial Times Global MBA Rankings.[3] It is the first business school in Indian subcontinent to be accredited by the Association to Advance Collegiate Schools of Business.[4] In 2008, it became the youngest institution to find a place in global MBA rankings when it was ranked 20.[5] Indian School of Business accepts both GMAT and GRE scores for the admission process.
This is an exclamatory sentence!! Is this one an interrogative one?? This one has an ellipsis… And this is a declarative sentence. My phone number is 12345 67890 whereas that of the school is +91-040-2318-7000.
1.3 Tokenization from First Principles
1.3.1 Introduction
Tokenization is the process of breaking down written text into distinct units of analysis - typically words.
However the process of breaking down text into units of analysis also extends to both a smaller scale such as at the character or syllable level, or a larger scale such as combinations of words (n-grams), chunks of words (e.g., phrases), larger units such as sentences and paragraphs etc (at which point, it becomes parsing).
In what follows, we use functions from the the stringr
package to construct a simple tokenizer.
First load the libraries we need
suppressPackageStartupMessages({
if (! require(stringr)){install.packages("stringr", dependencies = TRUE)}
library(stringr)
})
The plan is to start with tokenizing on a single, simple sentence. We’ll then extend it to larger collections of sentences.
1.3.2 Some basic string ops
# Example Commands
<- "My PHONE number is +91 40 2318 7106, is it?!"
my_string my_string
## [1] "My PHONE number is +91 40 2318 7106, is it?!"
<- tolower(my_string) # makes lower case
lower_string lower_string
## [1] "my phone number is +91 40 2318 7106, is it?!"
<- "OK folks, Second sentence, coming up."; second_string second_string
## [1] "OK folks, Second sentence, coming up."
<- paste(my_string, # try `?paste`
my_string
second_string, sep = " ")
my_string
## [1] "My PHONE number is +91 40 2318 7106, is it?! OK folks, Second sentence, coming up."
So far, so good. Let’s start splitting the strings next.
# Now, split out string up into a number of strings using the str_split() func
<- str_split(my_string, "!")[[1]]
my_string_vector
my_string_vector
## [1] "My PHONE number is +91 40 2318 7106, is it?"
## [2] " OK folks, Second sentence, coming up."
Notice in my_string_vectcor
above, that the splitting character (“!”) gets deleted and a list object is returned. Hence, the use of the list operator ([[1]]) to get the first entry.
In what follows, we’ll see the following functions in action. grep()
and grepl()
to find a particular text pattern using regex if necessary. We also have str_replace_all
and str_extract_all
coming up.
# search for string in my_string_vector that contains a `?` using the `grep()` command
grep("\\?", my_string_vector) # gives location of vector that contains match
## [1] 1
# using a conditional statement with logical grep, `grepl()`, maybe useful
grepl("\\?", my_string_vector[1]) # logical grep() for binary T/F output
## [1] TRUE
# str_replace_all() func replaces all instances of some characters with another character
str_replace_all(my_string, "e","___")
## [1] "My PHONE numb___r is +91 40 2318 7106, is it?! OK folks, S___cond s___nt___nc___, coming up."
# using regex to extract all numbers from a string
str_extract_all(my_string, "[0-9]+") # doesn't match decimals though
## [[1]]
## [1] "91" "40" "2318" "7106"
# alternately
str_extract_all(my_string, "[^a-zA-Z\\s]+") # every char not in a-z, A-Z or space \\s
## [[1]]
## [1] "+91" "40" "2318" "7106," "?!" "," "," "."
str_extract_all(my_string, "[\\d]+") # every char that is a digit
## [[1]]
## [1] "91" "40" "2318" "7106"
OK, finally. Let’s write our first, simple tokenizer.
1.3.3 A Simple Word Tokenizer
# string split taking space as token-separator.
= str_split(my_string, "\\s")
tokenized_vec
# basic tokenization using spaces
tokenized_vec
## [[1]]
## [1] "My" "PHONE" "number" "is" "+91" "40"
## [7] "2318" "7106," "is" "it?!" "OK" "folks,"
## [13] "Second" "sentence," "coming" "up."
We now know that with the appropriate regex expressions, we can narrow the above down to only include words and get rid of the numbers.
1.3.4 Functionizing the Tokenizer Workflow
Given what we have seen above, next, how about we functionize (so to say) what we did above?
If we write a function that automates the steps above, we could repeatedly invoke the function wherever required.
That’s where we are going next. I’ll define a simple func called clean_string()
(note naming convention for funcs) that will pre-process messy input text.
## Cleaning Text and Tokenization from First Principles (using Regex) ##
<- function(string){
clean_string
require(stringr)
<- tolower(string) # Lowercase
temp
# Replace all non-alpha-numerics with space
<- stringr::str_replace_all(temp,"[^a-zA-Z\\s]", " ")
temp
# collapse one or more spaces into one space using `+` regex or for repeats >1
<- stringr::str_replace_all(temp,"[\\s]+", " ")
temp
# Split string using space as separator
<- stringr::str_split(temp, " ")[[1]]
temp
# Get rid of trailing "" if necessary
<- which(temp == "")
indexes
if(length(indexes) > 0){ temp <- temp[-indexes] }
return(temp)
# Clean_String func ends }
Everyone with me so far? Go line-by-line later and ensure everything there is clear!
Now, let’s apply the function to a simple example and see.
Below is a line from Shakespere’s As you like it.
<- "All the world's a stage, and all the men and women merely players: they have their exits and their entrances; and one man in his time plays many parts.' "
sentence
<- clean_string(sentence) # invoking the function
clean_sentence print(clean_sentence) # view func output
## [1] "all" "the" "world" "s" "a" "stage"
## [7] "and" "all" "the" "men" "and" "women"
## [13] "merely" "players" "they" "have" "their" "exits"
## [19] "and" "their" "entrances" "and" "one" "man"
## [25] "in" "his" "time" "plays" "many" "parts"
unique(clean_sentence) # de-duplicated version using `unique()`.
## [1] "all" "the" "world" "s" "a" "stage"
## [7] "and" "men" "women" "merely" "players" "they"
## [13] "have" "their" "exits" "entrances" "one" "man"
## [19] "in" "his" "time" "plays" "many" "parts"
While the above was neat and all, it was still just one sentence. But what about large corpora with many, many documents, each with many, many sentences?
What we can do next is to use the above function as a subroutine in a larger function that will loop over sentences in documents. See the code below.
## === Now extend to multiple documents. Function to clean text blocks (or corpora)
<- function(text){
clean_text_block
# Check to see if there is any text at all with another conditional
if(length(text) == 0){ cat("There was no text in this document! \n"); stop }
# Get rid of blank lines
<- which(text == "")
indexes if(length(indexes) > 0){ text <- text[-indexes] }
# Loop through the lines in the text and use the `append()` function
<- clean_string(text[1])
clean_text
if (length(text)>1){
for(i in 2:length(text)){
# add them to a vector
<- append(clean_text, clean_string(text[i])) } # i loop ends
clean_text
# if condn ends
}
# Calc number of tokens and unique tokens. Return them in a named list object.
<- length(clean_text)
num_tok <- length(unique(clean_text))
num_uniq
<- list(num_tokens = num_tok, # output is list with 3 elements
to_return unique_tokens = num_uniq,
text = clean_text)
return(to_return) } # clean_text_block() func ends
Now, let’s try it on a sizeable dataset.
1.3.5 Tokenizing a real world dataset
The Amazon Nokia Lumia reviews dataset is an old dataset with about 120 consumer reviews. Not too large and will do for our purposes.
You can either read the file in using readLines(file.choose())
or directly read it off my Github page.
<- readLines('https://github.com/sudhir-voleti/sample-data-sets/raw/master/text%20analysis%20data/amazon%20nokia%20lumia%20reviews.txt')
text
system.time({
<- clean_text_block(text[1:10]) # invoking the func
clean_speech # didn't take long now, did it? })
## user system elapsed
## 0.022 0.004 0.027
str(clean_speech) # unlist and view structure of the output
## List of 3
## $ num_tokens : int 2831
## $ unique_tokens: int 786
## $ text : chr [1:2831] "i" "have" "had" "samsung" ...
$text[1:40] # view first 40 tokens from the dataset clean_speech
## [1] "i" "have" "had" "samsung" "phones" "where"
## [7] "the" "screens" "are" "nice" "but" "the"
## [13] "plastic" "bodies" "come" "apart" "all" "of"
## [19] "the" "time" "and" "feel" "flimsy" "i"
## [25] "have" "had" "iphones" "where" "the" "user"
## [31] "interface" "is" "just" "so" "boring" "even"
## [37] "with" "all" "those" "apps"
With basic tokenization done, we can now proceed and, from first principles alone, build a lot of text analysis functionality.
However, that would be duplicating a lot of work already done in R. There are many excellent packages in R that do the above - tokenizing etc and thereafter building data structures for further analysis.
We will follow that route. But it is important to know that, if need be, we could build all that functionality from first principles if we wanted to.
Sudhir Voleti
1.4 Introductory Text An with Tidy-Text
1.4.1 Basic Intro to Tidytext
Tidytext is text-an part of ‘tidyverse’ - a set of popular packages that work on a common set of code principles called the ‘tidy data framework’ originally created by the legendary (in R circles, at least) Hadley Wickham.
We’ll see some of these tidy data handling procedures as we proceed ahead. Tidytext (Silge and Robinson 2016) is described thus by its creators:
“Treating text as data frames of individual words allows us to manipulate, summarize, and visualize the characteristics of text easily and integrate natural language processing into effective workflows we were already using.”
So let’s jump right in.
suppressPackageStartupMessages({
if (!require(tidyverse)) {install.packages("tidyverse")}
if (!require(tidytext)) {install.packages("tidytext")}
library(tidyverse)
library(tidytext)
library(ggplot2)
})
1.4.2 Tidytext Analysis on the Nokia dataset
I return to an old favorite here: the 2013 Amazon Nokia Lumia reviews dataset of 120 consumer reviews.
For this example, we can read the data directly off my github page. Behold.
# Example reading in an amazon nokia corpus
= readLines('https://github.com/sudhir-voleti/sample-data-sets/raw/master/text%20analysis%20data/amazon%20nokia%20lumia%20reviews.txt')
nokia
<- nokia
text = gsub("<.*?>", " ", text) # regex for removing HTML tags
text length(text) # 120 documents
## [1] 120
1.4.3 Tidytext Tokenization with unnest_tokens()
First thing to note before invoking tidytext - get the text into a data_frame
format, called a tibble
, which has nice printing properties - e.g., it never floods the console.
In other words, we convert text_df into one-token-per-document-per-row.
Now time to see tidytext’s neat tokenizing capabilities at the level of:
- words,
- sentences,
- paragraphs, and
- ngrams,
using the unnest_tokens()
function.
require(tibble)
= tibble(text = text) # yields 120x1 tibble. i.e., each doc = 1 row here.
textdf
# view the tibble DF obj textdf
## # A tibble: 120 × 1
## text
## <chr>
## 1 "I have had Samsung phones, where the screens are nice, but the plastic bodi…
## 2 "This phone is great! It's the best phone I've ever owned. I have been a ded…
## 3 "I just bought this phone a few weeks ago and it is gorgeous. The screen jus…
## 4 "Hi, I bought a cell phone by Amazon with the following description: \"Noki…
## 5 "The Nokia Lumia 900 was my first smartphone. Im a programmer, so I really a…
## 6 "I am a tmobile customer and with an unlocked Nokia 900 Lumia from ATT .Let …
## 7 "I bought this phone because I wanted to use it on TMobile. I've only been f…
## 8 "I received my phone before the expected arrival date and in excellent condi…
## 9 "I bought the phone and it was nice but has one problem that my husband did …
## 10 "I purchased the Nokia Lumina 900 in black from BLUTECUSA and was really thr…
## # … with 110 more rows
# Tokenizing ops. Words first.
system.time({
%>% unnest_tokens(word, text) # try ?unnest_tokens
textdf })
## user system elapsed
## 0.006 0.000 0.007
So, how many words are there in the Nokia corpus?
Below, we can see sentence tokenization using the token = "sentences"
parm in unnest_tokens
.
# Tokenizing into sentences.
system.time({
= textdf %>%
sent_tokenized unnest_tokens(sentence, text, token = "sentences") # 905 rows/120 docs ~8 sents per doc?
# 0 secs })
## user system elapsed
## 0.003 0.000 0.003
# explore output object a bit
class(sent_tokenized)
## [1] "tbl_df" "tbl" "data.frame"
dim(sent_tokenized)
## [1] 905 1
head(sent_tokenized)
## # A tibble: 6 × 1
## sentence
## <chr>
## 1 i have had samsung phones, where the screens are nice, but the plastic bodies…
## 2 i have had iphones, where the user interface is just so boring, even with all…
## 3 the nokia has wonderful nokia apps already installed, such as nokia drive, wh…
## 4 you get nokia music, which is a free music service with 1500 playlists put to…
## 5 you get espn, nokia city lens, which is where you hold your camera up while y…
## 6 and there are other free nokia apps as well, like camera extras, that let you…
Sentence detection appears to be good, above.
Tokenizing DF into bigrams, below. Try different ngrams & see.
system.time({
= textdf %>%
ngram_tokenized unnest_tokens(ngram, text, token = "ngrams", n = 2) # yields (#tokens-1) bigrams
# 0.03 secs })
## user system elapsed
## 0.007 0.000 0.008
#explore the output object
class(ngram_tokenized)
## [1] "tbl_df" "tbl" "data.frame"
dim(ngram_tokenized)
## [1] 15275 1
head(ngram_tokenized)
## # A tibble: 6 × 1
## ngram
## <chr>
## 1 i have
## 2 have had
## 3 had samsung
## 4 samsung phones
## 5 phones where
## 6 where the
1.4.4 Grouping, Aggegation and Join ops in tidytext
Intuitively, the funcs we’ll call in tidytext are named group_by()
, count()
and join()
. Behold.
# use count() to see most common words
system.time({
= textdf %>%
nokia_words unnest_tokens(word, text) %>% # tokenized words in df in 'word' colm
count(word, sort = TRUE) %>% # counts & sorts no. of occurrences of each item in 'word' column
rename(count = n) # renames the count column from 'n' (default name) to 'count'.
# 0.11 secs })
## user system elapsed
## 0.037 0.000 0.037
%>% head(., 10) # view top 10 rows in nokia-words df nokia_words
## # A tibble: 10 × 2
## word count
## <chr> <int>
## 1 the 682
## 2 i 510
## 3 and 448
## 4 to 427
## 5 phone 388
## 6 a 362
## 7 it 358
## 8 is 299
## 9 for 202
## 10 this 201
Unsurprisingly, the most common words, with the highest no. of occurrences are the scaffolding of grammer - articles, prepositions, conjunctions etc. Not exactly the meaty stuff.
Is there anything we can do about it? Yes, we can. (Pun intended). We could filter the words using a stopwords list from earlier.
Conveniently, tidytext comes with its own stop_words
list.
data(stop_words)
# use anti_join() to de-merge stopwords from the df
system.time({
= textdf %>%
nokia_new unnest_tokens(word, text) %>%
count(word, sort = TRUE) %>%
rename(count = n) %>%
anti_join(stop_words) # try ?anti_join
# 0.03 secs })
## Joining, by = "word"
## user system elapsed
## 0.044 0.000 0.045
%>% head(., 10) nokia_new
## # A tibble: 10 × 2
## word count
## <chr> <int>
## 1 phone 388
## 2 nokia 87
## 3 windows 77
## 4 apps 70
## 5 lumia 62
## 6 900 45
## 7 unlocked 45
## 8 phones 42
## 9 battery 34
## 10 android 32
Aha. With the stopwords filtered out, the scene changes quite a bit.
How about some visualization via bar-charts, just for completeness? See below.
# First, build a datafame
<- textdf %>%
tidy_nokia unnest_tokens(word, text) %>% # word tokenization
anti_join(stop_words) # run ?join::dplyr
## Joining, by = "word"
# Visualize the commonly used words using ggplot2.
library(ggplot2)
%>%
tidy_nokia count(word, sort = TRUE) %>%
filter(n > 20) %>% # n is wordcount colname.
mutate(word = reorder(word, n)) %>% # mutate() reorders columns & renames too
ggplot(aes(word, n)) +
geom_bar(stat = "identity", col = "red", fill = "red") +
xlab(NULL) +
coord_flip()
### More Bigram ops with tidytext
Bigrams in particular, and ngrams in general will have several uses down the line, as we’ll see. As such they link BOW with NLP in a way.
First, let’s build and view the bigrams with token = "ngrams, n=2
argument in the unnest_tokens()
func.
<- textdf %>%
nokia_bigrams unnest_tokens(bigram, text,
token = "ngrams", n = 2)
nokia_bigrams
## # A tibble: 15,275 × 1
## bigram
## <chr>
## 1 i have
## 2 have had
## 3 had samsung
## 4 samsung phones
## 5 phones where
## 6 where the
## 7 the screens
## 8 screens are
## 9 are nice
## 10 nice but
## # … with 15,265 more rows
Now use separate()
to str_split() the words.
require(tidyr)
# separate bigrams
<- nokia_bigrams %>%
bigrams_separated separate(bigram, c("word1", "word2"), sep = " ")
bigrams_separated
## # A tibble: 15,275 × 2
## word1 word2
## <chr> <chr>
## 1 i have
## 2 have had
## 3 had samsung
## 4 samsung phones
## 5 phones where
## 6 where the
## 7 the screens
## 8 screens are
## 9 are nice
## 10 nice but
## # … with 15,265 more rows
Very many stopwords in there obscuring meaning. So let’s remove them by successively doing filter()
on the first and second words.
# filtering the bigrams to remove stopwords
<- bigrams_separated %>%
bigrams_filtered filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
bigrams_filtered
## # A tibble: 1,969 × 2
## word1 word2
## <chr> <chr>
## 1 samsung phones
## 2 plastic bodies
## 3 feel flimsy
## 4 user interface
## 5 wonderful nokia
## 6 nokia apps
## 7 nokia drive
## 8 talking attendant
## 9 super accurate
## 10 nokia music
## # … with 1,959 more rows
Other ops like counting and sorting should be standard by now.
# New bigram counts:
<- bigrams_filtered %>%
bigram_counts count(word1, word2, sort = TRUE)
bigram_counts
## # A tibble: 1,631 × 3
## word1 word2 n
## <chr> <chr> <int>
## 1 lumia 900 34
## 2 windows phone 31
## 3 nokia lumia 23
## 4 battery life 14
## 5 cell phone 12
## 6 lumia 920 12
## 7 sim card 12
## 8 windows 8 11
## 9 data plan 7
## 10 gorilla glass 7
## # … with 1,621 more rows
tidyr’s unite()
function is the inverse of separate(), and lets us recombine the columns into one.
Thus, “separate / filter / count / unite” functions let us find the most common bigrams not containing stop-words.
<- bigrams_filtered %>%
bigrams_united unite(bigram, word1, word2, sep = " ")
bigrams_united
## # A tibble: 1,969 × 1
## bigram
## <chr>
## 1 samsung phones
## 2 plastic bodies
## 3 feel flimsy
## 4 user interface
## 5 wonderful nokia
## 6 nokia apps
## 7 nokia drive
## 8 talking attendant
## 9 super accurate
## 10 nokia music
## # … with 1,959 more rows
Suppose we want to identify all those bigrams wherein the words “cloud”, “windows” or “solution” appeared either in word1 or in word2.
In other words, we want to match arbitrary word strings with bigram components. See code below.
# filter for bigrams which contain the word game, intelligent, or cloud
= bigrams_filtered %>%
arbit_filter filter(word1 %in% c("windows", "solution", "cloud") | word2 %in% c("windows", "solution", "cloud")) %>%
count(word1, word2, sort = TRUE)
%>% filter(word1 == "windows" | word2 == "cloud") # try for cloud in word2 etc. arbit_filter
## # A tibble: 22 × 3
## word1 word2 n
## <chr> <chr> <int>
## 1 windows phone 31
## 2 windows 8 11
## 3 windows phones 7
## 4 windows 7.5 3
## 5 windows platform 3
## 6 windows 7 2
## 7 windows 7.8 2
## 8 windows live 2
## 9 free cloud 1
## 10 frees cloud 1
## # … with 12 more rows
1.4.5 Casting tidy data into DTMs
Nice that tidy data can do so much. But how about building standard text-an data objects like DTMs etc?
Sure, tidytext can very well do that. First, we convert the text corpus into the tidy format, essentially, one-token-per-row-per-document. See the code below.
# First convert Nokia corpus to tidy format, i.e. a tibble with token and count per row per document.
= textdf %>%
tidy_nokia mutate(doc = row_number()) %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
group_by(doc) %>%
count(word, sort=TRUE)
## Joining, by = "word"
tidy_nokia
## # A tibble: 4,196 × 3
## # Groups: doc [120]
## doc word n
## <int> <chr> <int>
## 1 25 phone 27
## 2 59 lumia 20
## 3 2 phone 15
## 4 19 phone 12
## 5 8 phone 11
## 6 25 sms 11
## 7 25 view 11
## 8 59 920 11
## 9 59 windows 11
## 10 25 option 10
## # … with 4,186 more rows
Now, to cast this tidy object into a DTM, into a regular (if sparse) matrix etc. See code below.
# cast into a Document-Term Matrix
= tidy_nokia %>%
nokia_dtm cast_dtm(doc, word, n)
# yields a sparse matrix format by default nokia_dtm
## <<DocumentTermMatrix (documents: 120, terms: 1964)>>
## Non-/sparse entries: 4196/231484
## Sparsity : 98%
## Maximal term length: 17
## Weighting : term frequency (tf)
Below, we convert tidy_nokia
into a regular matrix.
# cast into a Matrix object
<- tidy_nokia %>% cast_sparse(doc, word, n)
nokia_dtm # class(nokia_dtm) # Matrix
1:6,1:6] nokia_dtm[
## 6 x 6 sparse Matrix of class "dgCMatrix"
## phone lumia sms view 920 windows
## 25 27 7 11 11 . 1
## 59 9 20 . . 11 11
## 2 15 2 . . . 4
## 19 12 1 . . . 5
## 8 11 2 . . . 2
## 9 9 . . . . .
1.4.6 the TFIDF transformation
Recall we mentioned there’re two weighing schemes for token frequencies in the DTM.
What we saw above was the simpler, first one, namely, term frequency or TF.
Now we introduce the second one - TFIDF. Which stands for term frequency-inverse document frequency.
Time to move to the slides and the board before returning here. Tidytext code for the TFIDF transformation is below.
P.S. Let \(tf(term)\) denote the term frequncy of some term of interest. Then \(tfidf(term)\) is defind as \(\frac{tf(term)}{idf(term)}\)
where \(idf(term) = ln\frac{n_{documents}}{n_{documents.with.term}}\). Various other schemes have been proposed.
= tidy_nokia %>%
nokia_dtm_idf group_by(doc) %>%
count(word, sort=TRUE) %>% ungroup() %>%
bind_tf_idf(word, doc, n) %>%
cast_sparse(doc, word, tf_idf)
1:8, 1:8] # view a few rows nokia_dtm_idf[
## 8 x 8 sparse Matrix of class "dgCMatrix"
## 1080p 1500 7.5 720p 7gb accurate action
## 1 0.03198707 0.03740228 0.02482855 0.03740228 0.03198707 0.03740228 0.03740228
## 2 . . . . 0.03411954 . .
## 3 . . . . . . .
## 4 . . . . . . .
## 5 . . . . . . .
## 6 . . . . . . .
## 7 . . . . . . .
## 8 . . . . . . .
## adjust
## 1 0.03198707
## 2 .
## 3 .
## 4 .
## 5 .
## 6 .
## 7 .
## 8 .
Last but not least is the ability to explicitly leverage tidytext’s tidy data format to do some cool stuff. Below is only if we have time.
1.4.7 Applying Tidy Principles on Text data
The Nokia corpus has 120 documents. Each document has several sentences. And each sentence has several words in it. Here’s a Q for you now.
Suppose you want to know how many sentences are there in each document in the corpus. And how many words are there in each sentence. How would you go about this?
dplyr
functions inside tidytext get us there easily because the data are arranged as per tidy principles.
I present a few simple examples below. These operators while simple enable us to build fairly complex structures by recombining and chaining them together.
Let’s answer the question set above. I’ll use functions row_number()
and group_by()
. See the code below.
### Create a document id and group_by it
= textdf %>%
textdf_doc mutate(doc = row_number()) %>% # creates an id number for each row, i.e., doc.
select(doc, text)
# group_by(doc) # internally groups docs together
textdf_doc
## # A tibble: 120 × 2
## doc text
## <int> <chr>
## 1 1 "I have had Samsung phones, where the screens are nice, but the plasti…
## 2 2 "This phone is great! It's the best phone I've ever owned. I have been…
## 3 3 "I just bought this phone a few weeks ago and it is gorgeous. The scre…
## 4 4 "Hi, I bought a cell phone by Amazon with the following description: …
## 5 5 "The Nokia Lumia 900 was my first smartphone. Im a programmer, so I re…
## 6 6 "I am a tmobile customer and with an unlocked Nokia 900 Lumia from ATT…
## 7 7 "I bought this phone because I wanted to use it on TMobile. I've only …
## 8 8 "I received my phone before the expected arrival date and in excellent…
## 9 9 "I bought the phone and it was nice but has one problem that my husban…
## 10 10 "I purchased the Nokia Lumina 900 in black from BLUTECUSA and was real…
## # … with 110 more rows
So I have doc id and corresponding text. What I next need is a sentence identifier within each doc.
Then, a word identifier for each word within each sentence. Finally, can simply merge the dataframes. See below.
### How many sentences in each doc?
= textdf_doc %>%
textdf_sent unnest_tokens(sentence, text, token = "sentences") %>%
group_by(doc) %>%
mutate(sent_id = row_number()) %>% # create sentence id by enumeration
summarise(sent_doc = max(sent_id)) %>% # max sent_id is the num of sents in doc
select(doc, sent_doc) # retain colms in order for clean display
# first few rows, view. textdf_sent
## # A tibble: 120 × 2
## doc sent_doc
## <int> <int>
## 1 1 17
## 2 2 28
## 3 3 21
## 4 4 6
## 5 5 19
## 6 6 27
## 7 7 5
## 8 8 25
## 9 9 7
## 10 10 4
## # … with 110 more rows
So, on average how many sentences does a product reviewer write on Amazon for this corpus? The above DF gives the answer.
Next, to find average doc length, will need to know how many words are there in each document. See below.
### How many words in each document?
= textdf_doc %>% # using textdf_doc and not textdf_sent!
textdf_word unnest_tokens(word, text) %>%
group_by(doc) %>%
mutate(word_id = row_number()) %>%
summarise(word_doc = max(word_id)) %>%
select(doc, word_doc)
1:10,] textdf_word[
## # A tibble: 10 × 2
## doc word_doc
## <int> <int>
## 1 1 415
## 2 2 521
## 3 3 322
## 4 4 92
## 5 5 348
## 6 6 459
## 7 7 65
## 8 8 336
## 9 9 164
## 10 10 58
The Qs continue. So I have textdf_sent that maps sentences to documents and textdf_word that maps wordcounts to document.
Can I combine the two dataframes? Yes we can. Try ?inner_join
and see.
### can we merge the above 2 tables together?
# colnames(textdf_sent) # doc, sent_doc
# colnames(textdf_word) # doc, word_doc
= inner_join(textdf_sent, # first df
doc_sent_word # second df
textdf_word, by = "doc") %>% # inner_join by 'doc' colm
mutate(words_per_sent = word_doc/sent_doc) # avg no. of words per sentence as a new variable
1:10,] # view first few rows of the new merged table doc_sent_word[
## # A tibble: 10 × 4
## doc sent_doc word_doc words_per_sent
## <int> <int> <int> <dbl>
## 1 1 17 415 24.4
## 2 2 28 521 18.6
## 3 3 21 322 15.3
## 4 4 6 92 15.3
## 5 5 19 348 18.3
## 6 6 27 459 17
## 7 7 5 65 13
## 8 8 25 336 13.4
## 9 9 7 164 23.4
## 10 10 4 58 14.5
Neat, eh? One last thing. We’ve mapped sentences to documents and words to docs. Can we now map words to sentences?
In other words, how many words are there per sentence? See below.
### how many words in each sentence?
= textdf %>%
textdf_sent_word unnest_tokens(sentence, text, token = "sentences") %>%
mutate(senten = row_number()) %>% # building an index for sentences now
select(senten, sentence) %>%
unnest_tokens(word, sentence) %>%
mutate(word_id = row_number()) %>% # creating an intermed variable for word-counting
select(senten, word_id) %>%
group_by(senten) %>% # grouping by sent id
summarise(words_in_sent = max(word_id)) %>% # summarizing
select(senten, words_in_sent) # retain only relevant cols
11:20,] textdf_sent_word[
## # A tibble: 10 × 2
## senten words_in_sent
## <int> <int>
## 1 11 301
## 2 12 319
## 3 13 340
## 4 14 373
## 5 15 392
## 6 16 407
## 7 17 415
## 8 18 419
## 9 19 426
## 10 20 455
Can we merge the document number also above? Yes. But I’ll leave that as a practice exercise for you.
We now know enough to code basic text-an functions in tidytext. And further to head to sentiment-an in the next session.
Sudhir
1.5 Basic Text-An funcs in Tidytext
Hi all,
First things first. Here is the setup code chunk. If any of the libraries are missing in your system, well, you know how to install them.
suppressPackageStartupMessages({
if (!require(tm)) {install.packages("tm")}
if (!require(wordcloud)) {install.packages("wordcloud")}
if (!require(igraph)) {install.packages("igraph")}
if (!require(ggraph)) {install.packages("ggraph")}
library(tm)
library(tidyverse)
library(tidytext)
library(wordcloud)
library(igraph)
library(ggraph)
})
1.5.1 Read in a real dataset (IBM) for basic text-an
Below is a transcript of IBM Q3’s 2016 analyst call. What kind of context would you expect to see in an analyst call report?
Can we quickly text-an the same and figure out what the content is saying?
## reading in IBM analyst call data from my git
= readLines('https://raw.githubusercontent.com/sudhir-voleti/sample-data-sets/master/International%20Business%20Machines%20(IBM)%20Q3%202016%20Results%20-%20Earnings%20Call%20Transcript.txt') #IBM Q3 2016 analyst call transcript
ibm # ibm = readLines(file.choose()) # read from local file on disk
head(ibm, 5) # view a few lines
## [1] "International Business Machines Corporation. (NYSE:IBM)"
## [2] "Q3 2016 Results Earnings Conference Call"
## [3] "October 17, 2016, 05:00 PM ET"
## [4] "Executives"
## [5] "Patricia Murphy - Vice President-Investor Relations"
1.5.2 Basic Text-An funcs for Session 1
We’ll now quickly cover a few basic text-an ops which I will present in functionized form. What basic ops are these? Let’s see:
- Func 1: Cleaning input data using a
text.clean()
func
- Func 2: Constructing DTMs, both TF and TF-IDF based, using the
dtm_build()
func - Func 3: Our first text display aid
build_wordcloud()
- Func 4: Next display aid
plot.barchart()
- Func 5: Third display aid is the co-occurrence graph or COG via
distill.cog()
- Func 6: Final display aid is the combo of COG and wordcloud via
build_cog_ggraph
Plan is to use the IBM analyst call data we have to test drive these funcs we are building.
P.S. Pay attention to the way the funcs are defined, argument construction and their defaults, etc.
Ultimately, plan is to write your own custom funcs, host on github for collaborative improvement, and source them into R when working on live projects.
1.5.3 Func 1: Text Cleaning
= function(x, # x=text_corpus
text.clean remove_numbers=TRUE, # whether to drop numbers? Default is TRUE
remove_stopwords=TRUE) # whether to drop stopwords? Default is TRUE
library(tm)
{ = gsub("<.*?>", " ", x) # regex for removing HTML tags
x = iconv(x, "latin1", "ASCII", sub="") # Keep only ASCII characters
x = gsub("[^[:alnum:]]", " ", x) # keep only alpha numeric
x = tolower(x) # convert to lower case characters
x
if (remove_numbers) { x = removeNumbers(x)} # removing numbers
= stripWhitespace(x) # removing white space
x = gsub("^\\s+|\\s+$", "", x) # remove leading and trailing white space. Note regex usage
x
# evlauate condn
if (remove_stopwords){
# read std stopwords list from my git
= readLines('https://raw.githubusercontent.com/sudhir-voleti/basic-text-analysis-shinyapp/master/data/stopwords.txt')
stpw1
# tm package stop word list; tokenizer package has the same name function, hence 'tm::'
= tm::stopwords('english')
stpw2 = unique(c(stpw1, stpw2)) # Union of the two lists
comn = unique(gsub("'"," ",comn)) # final stop word list after removing punctuation
stopwords
# removing stopwords created above
= removeWords(x,stopwords) } # if condn ends
x
= stripWhitespace(x) # removing white space
x # x = stemDocument(x) # can stem doc if needed. For Later.
return(x) } # func ends
How well does this func work? Can we check? Let’s test-drive it with the ibm data we doanloaded. See below.
# check with ibm data
system.time({ ibm.clean = text.clean(ibm, remove_numbers=FALSE) }) # 0.26 secs
## user system elapsed
## 0.100 0.000 0.574
1.5.4 Func 2: DTM builder (whether TF or IDF)
We saw how to build DTMs. Let us functionize that code in general terms so that we can repeatedly invoke the func where required.
<- function(raw_corpus, tfidf=FALSE)
dtm_build # func opens
{
require(tidytext); require(tibble); require(tidyverse)
# converting raw corpus to tibble to tidy DF
= tibble(text = raw_corpus); textdf
textdf
= textdf %>%
tidy_df mutate(doc = row_number()) %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
group_by(doc) %>%
count(word, sort=TRUE)
tidy_df
# evaluating IDF wala DTM
if (tfidf == "TRUE") {
= tidy_df %>%
textdf1 group_by(doc) %>%
count(word, sort=TRUE) %>% #ungroup() %>%
bind_tf_idf(word, doc, n) %>% # 'n' is default colm name
rename(value = tf_idf)
else {
} = tidy_df %>% rename(value = n) }
textdf1
textdf1
= textdf1 %>% cast_sparse(doc, word, value); dtm[1:9, 1:9]
dtm
# order rows and colms putting max mass on the top-left corner of the DTM
= apply(dtm, 2, sum)
colsum = order(colsum, decreasing=TRUE)
col.order = order(rownames(dtm) %>% as.numeric())
row.order
= dtm[row.order, col.order]; dtm1[1:8,1:8]
dtm1
return(dtm1) } # func ends
# testing func 2 on ibm data
system.time({ dtm_ibm_tf = dtm_build(ibm) }) # 0.02 secs
## Joining, by = "word"
## user system elapsed
## 0.064 0.000 0.064
system.time({ dtm_ibm_idf = dtm_build(ibm, tfidf=TRUE) }) # 0.05 secs
## Joining, by = "word"
## user system elapsed
## 0.09 0.00 0.09
1.5.5 Func 3: wordcloud building
<- function(dtm,
build_wordcloud max.words1=150, # max no. of words to accommodate
min.freq=5, # min.freq of words to consider
plot.title="wordcloud"){ # write within double quotes
require(wordcloud)
if (ncol(dtm) > 20000){ # if dtm is overly large, break into chunks and solve
= round(ncol(dtm)/100) # divide DTM's cols into 100 manageble parts
tst = rep(tst,99)
a = cumsum(a);rm(a)
b = c(0,b,ncol(dtm))
b
= c(NULL)
ss.col for (i in 1:(length(b)-1)) {
= dtm[,(b[i]+1):(b[i+1])]
tempdtm = colSums(as.matrix(tempdtm))
s = c(ss.col,s)
ss.col print(i) } # i loop ends
= ss.col
tsum
else { tsum = apply(dtm, 2, sum) }
}
= tsum[order(tsum, decreasing = T)] # terms in decreasing order of freq
tsum head(tsum); tail(tsum)
# windows() # Opens a new plot window when active
wordcloud(names(tsum), tsum, # words, their freqs
scale = c(3.5, 0.5), # range of word sizes
# min.freq of words to consider
min.freq, max.words = max.words1, # max #words
colors = brewer.pal(8, "Dark2")) # Plot results in a word cloud
title(sub = plot.title) # title for the wordcloud display
# func ends
}
# test-driving func 3 via IBM data
system.time({ build_wordcloud(dtm_ibm_tf, plot.title="IBM TF wordlcoud") }) # 0.4 secs
## Warning in wordcloud(names(tsum), tsum, scale = c(3.5, 0.5), min.freq, max.words
## = max.words1, : business could not be fit on page. It will not be plotted.
## user system elapsed
## 0.521 0.009 0.436
And now, test driving the IDF one…
system.time({ build_wordcloud(dtm_ibm_idf, plot.title="IBM IDF wordlcoud", min.freq=2) }) # 0.09 secs
## user system elapsed
## 0.321 0.007 0.214
1.5.6 Func 4: Simple Bar.charts of top tokens
Self-explanatory. And simple. But just for completeness sake, making a func out of it.
<- function(dtm, num_tokens=15, fill_color="Blue")
plot.barchart
{= apply(dtm, 2, sum)
a0 = order(a0, decreasing = TRUE)
a1 = a0[a1]
tsum
# plot barchart for top tokens
= data.frame(term = names(tsum)[1:num_tokens], freq=round(tsum[1:num_tokens],0)); test
test
# windows() # New plot window
require(ggplot2)
= ggplot(test, aes(x = term, y = freq)) +
p geom_bar(stat = "identity", fill = fill_color) +
geom_text(aes(label = freq), vjust= -0.20) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
plot(p) } # func ends
# testing above func
system.time({ plot.barchart(dtm_ibm_tf) }) # 0.1 secs
## user system elapsed
## 0.089 0.000 0.087
# system.time({ plot.barchart(dtm_ibm_idf, num_tokens=12, fill_color="Red") }) # 0.11 secs
1.5.7 Func 5: Co-occurrence graphs (COGs)
COGs as the ame suggests connects those tokens together that most co-occur within documents, using a network graph wherein the nodes are tokens of interest.
This is admittedly a slightly long-winded func. Also introduces network visualization concepts. If you’re unfamiliar with this, pls execute the func’s content line-by-line to see what each line does.
= function(dtm, # input dtm
distill.cog title="COG", # title for the graph
central.nodes=4, # no. of central nodes
max.connexns = 5){ # max no. of connections
# first convert dtm to an adjacency matrix
= as.matrix(dtm) # need it as a regular matrix for matrix ops like %*% to apply
dtm1 = t(dtm1) %*% dtm1 # making a square symmatric term-term matrix
adj.mat diag(adj.mat) = 0 # no self-references. So diag is 0.
= order(apply(adj.mat, 2, sum), decreasing = T) # order cols by descending colSum
a0 = as.matrix(adj.mat[a0[1:50], a0[1:50]])
mat1
# now invoke network plotting lib igraph
library(igraph)
= colSums(mat1) # collect colsums into a vector obj a
a = order(-a) # nice syntax for ordering vector in decr order
b
= mat1[b, b] # order both rows and columns along vector b
mat2 diag(mat2) = 0
## +++ go row by row and find top k adjacencies +++ ##
= NULL
wc
for (i1 in 1:central.nodes){
= mat2[i1,][order(-mat2[i1, ])[max.connexns]]
thresh1 < thresh1] = 0 # neat. didn't need 2 use () in the subset here.
mat2[i1, mat2[i1,] > 0 ] = 1
mat2[i1, mat2[i1,] = names(mat2[i1, mat2[i1,] > 0])
word +1):nrow(mat2), match(word,colnames(mat2))] = 0
mat2[(i1= c(wc, word)
wc # i1 loop ends
}
= mat2[match(wc, colnames(mat2)), match(wc, colnames(mat2))]
mat3 = colnames(mat2)[which(!is.na(match(colnames(mat2), colnames(mat3))))] # removed any NAs from the list
ord = mat3[match(ord, colnames(mat3)), match(ord, colnames(mat3))]
mat4
# building and plotting a network object
<- graph.adjacency(mat4, mode = "undirected", weighted=T) # Create Network object
graph = simplify(graph)
graph V(graph)$color[1:central.nodes] = "green"
V(graph)$color[(central.nodes+1):length(V(graph))] = "pink"
= delete.vertices(graph, V(graph)[ degree(graph) == 0 ]) # delete singletons?
graph
plot(graph,
layout = layout.kamada.kawai,
main = title)
# distill.cog func ends
}
# testing COG on ibm data
system.time({ distill.cog(dtm_ibm_tf, "COG for IBM TF") }) # 0.27 secs
## Warning in vattrs[[name]][index] <- value: number of items to replace is not a
## multiple of replacement length
## user system elapsed
## 0.412 0.012 0.418
# system.time({ distill.cog(dtm_ibm_idf, "COG for IBM IDF", 5, 5) }) # 0.57 secs
1.5.8 Func 6 - wordcloud + COG combo
Both the 2 major display aids we saw thus far - cog and wordcloud - have their pros and cons. Can we somehow combine them and get the best of both worlds, so to say? Read on.
<- function(corpus, # text colmn only
build_cog_ggraph max_edges = 150,
drop.stop_words=TRUE,
new.stopwords=NULL){
# invoke libraries
library(tidyverse)
library(tidytext)
library(widyr)
library(ggraph)
# build df from corpus
= data.frame(docID = seq(1:length(corpus)), text = corpus, stringsAsFactors=FALSE)
corpus_df
# eval stopwords condn
if (drop.stop_words == TRUE) {stop.words = unique(c(stop_words$word, new.stopwords)) %>%
as_tibble() %>% rename(word=value)} else {stop.words = stop_words[2,]}
# build word-pairs
<- corpus_df %>%
tokens
# tokenize, drop stop_words etc
unnest_tokens(word, text) %>% anti_join(stop.words)
# pairwise_count() counts #token-pairs co-occuring in docs
= tokens %>% pairwise_count(word, docID, sort = TRUE, upper = FALSE)# %>% # head()
word_pairs
= tokens %>% count( word,sort = T) %>% dplyr::rename( wordfr = n)
word_counts
= word_pairs %>% left_join(word_counts, by = c("item1" = "word"))
word_pairs
= min(nrow(word_pairs), max_edges)
row_thresh
# now plot
set.seed(1234)
# windows()
<- word_pairs %>%
plot_d filter(n >= 3) %>%
top_n(row_thresh) %>% igraph::graph_from_data_frame()
= data_frame(vertices = names(V(plot_d))) %>% left_join(word_counts, by = c("vertices"= "word"))
dfwordcloud
= plot_d %>% # graph object built!
plot_obj
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
# geom_node_point(size = 5) +
geom_node_point(size = log(dfwordcloud$wordfr)) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines"),
size = 1 + log(dfwordcloud$wordfr)) +
theme_void()
return(plot_obj) # must return func output
# func ends
}
# quick example for above using amazon nokia corpus
= readLines('https://github.com/sudhir-voleti/sample-data-sets/raw/master/text%20analysis%20data/amazon%20nokia%20lumia%20reviews.txt')
nokia system.time({ b0=build_cog_ggraph(nokia) }) # 0.36 secs
## Joining, by = "word"
## Warning: `distinct_()` was deprecated in dplyr 0.7.0.
## Please use `distinct()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## Selecting by wordfr
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## user system elapsed
## 0.279 0.003 0.281
b0
## Warning: ggrepel: 2 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Clearly, the most frequently occurring token above has taken epi-central node status. What if we dropped it? What new patterns might emerge? Points to ponder…
1.6 Tidytext funcs based exercise
Creatively apply our basic text-an funcs to the file ‘ice-cream data.txt’ available at this link:
https://raw.githubusercontent.com/sudhir-voleti/sample-data-sets/master/Ice-cream-dataset.txt
Now answer these Qs:
- What was the size of the DTM finally built?
- Which stand-alone (or single) flavours seem to be the most popular?
- Which flavour combinations (2 or more) seem to be popular?
- What display aids did you use to answer the above questions?
Well, that’s it for now. I’m sure I have run out of time. If so, will pickup in the next session from where we leave off.
Sudhir
1.7 Homebrewing DTMs and n-grams
1.7.1 Introduction to this markdown
We saw how to home-brew a simple tokenizer from first principles. P.S. Coders sometimes refer to cooking up one’s own functions as home-brewing.
After tokenization, certain other steps in text-an follow. Such as, building Document Term Matrices (DTMs), n-grams of interest, etc.
Now R has several libraries that perform these tasks. But before we go there, the aim is to once, quickly, home-brew these critical data structures, then it helps build and extend our understanding, as well as confidence.
1.7.2 Invoking the Home-brewed tokenizer
Let’s quickly read-in the tokenizers we built. Then we’ll extend / build on top of them and arrive at home-brewed DTMs and n-grams.
# first things first
rm(list = ls()) # clears workspace
library(stringr)
Now the two funcs themselves - Tokenize_String()
and Tokenize_Text_Block()
- see code below, borrowed from the last RMD.
P.S. Note func naming convention here: Am using capitalized phrases for func names, unlike previously. Feel free to follow any naming convention you want to, as long as it logical and consistent.
# Re-read the two 'home-brewed' tokenizer functions we'd built.
<- function(string){
Tokenize_String
<- tolower(string) # Lowercase
temp
<- stringr::str_replace_all(temp,"[^a-zA-Z\\s]", " ") # replace non-alphabets with \s
temp
<- stringr::str_replace_all(temp,"[\\s]+", " ") # collapse >1 spaces into 1 space
temp
<- stringr::str_split(temp, " ")[[1]]
temp
<- which(temp == "")
indexes if(length(indexes) > 0){ temp <- temp[-indexes] }
return(temp)
# Tokenize_String func ends }
We defined a unit func above, for single strings.
Next, we’ll wrap the unit func into a larger ‘wrapper’ func defined over collections of strings. Behold.
# This below is the Tokenizer for text blocks
<- function(text){
Tokenize_Text_Block
# Check to see if there is any text at all with another conditional
if(length(text) == 0){ cat("There was no text in this document! \n"); stop }
# Get rid of blank lines
<- which(text == "")
indexes if(length(indexes) > 0){ text <- text[-indexes] }
# Loop through the lines in the text and use the append() function to
<- Tokenize_String(text[1])
clean_text if (length(text)>1){
for(i in 2:length(text)){
# add them to a vector
<- append(clean_text, Tokenize_String(text[i])) } # i loop ends
clean_text
# if condn ends
}
# Calc number of tokens and unique tokens. Return them in a named list object.
<- length(clean_text)
num_tok <- length(unique(clean_text))
num_uniq
<- list(num_tokens = num_tok, # output is list with 3 elements
to_return unique_tokens = num_uniq,
text = clean_text)
return(to_return)
# Tokenize_Text_Block func ends }
As an example, we consider a small text corpus. Specifically the first 2 paragraphs from the Wikipedia page on ISB.
# Reading in an ISB example (first two paras from Wiki)
<- "Indian School of Business (ISB), is one of the prestigious business schools in India and is rated amongst the best in the world .The institute has various management programs with the PGP as its flagship course .Admission to the coveted management program is rigorous and has also the distinction of having one of the most stringent selectivity ratio internationally . The student cohort has a diverse mix with students coming in from top Indian and International Universities like the IITs , BITS Pilani, NITs , SRCC , ISI etc. There are students who are doctors , people from defence establishments ,sportsman and who have excelled in various other professions .ISB has integrated campuses at Mohali, Punjab and Hyderabad, Telangana, and is a non profit organization.[2] The school was founded by two senior executives of McKinsey & Company with the backing of government and is governed by a board comprising both Indian and non-Indian businessmen."
text.para1 text.para1
## [1] "Indian School of Business (ISB), is one of the prestigious business schools in India and is rated amongst the best in the world .The institute has various management programs with the PGP as its flagship course .Admission to the coveted management program is rigorous and has also the distinction of having one of the most stringent selectivity ratio internationally . The student cohort has a diverse mix with students coming in from top Indian and International Universities like the IITs , BITS Pilani, NITs , SRCC , ISI etc. There are students who are doctors , people from defence establishments ,sportsman and who have excelled in various other professions .ISB has integrated campuses at Mohali, Punjab and Hyderabad, Telangana, and is a non profit organization.[2] The school was founded by two senior executives of McKinsey & Company with the backing of government and is governed by a board comprising both Indian and non-Indian businessmen."
This below is Para 2.
<- "ISB has been ranked 27th in the world in the 2017 Financial Times Global MBA Rankings.[3] It is the first business school in Indian subcontinent to be accredited by the Association to Advance Collegiate Schools of Business.[4] In 2008, it became the youngest institution to find a place in global MBA rankings when it was ranked 20.[5] Indian School of Business accepts both GMAT and GRE scores for the admission process."
text.para2 text.para2
## [1] "ISB has been ranked 27th in the world in the 2017 Financial Times Global MBA Rankings.[3] It is the first business school in Indian subcontinent to be accredited by the Association to Advance Collegiate Schools of Business.[4] In 2008, it became the youngest institution to find a place in global MBA rankings when it was ranked 20.[5] Indian School of Business accepts both GMAT and GRE scores for the admission process."
Now, we invoke the funcs above and tokenize these paragraphs.
# tokenize these paragraphs
= Tokenize_Text_Block(text.para1)
out.para1 = Tokenize_Text_Block(text.para2)
out.para2 str(out.para1) # check the structure or str() of the output
## List of 3
## $ num_tokens : int 149
## $ unique_tokens: int 97
## $ text : chr [1:149] "indian" "school" "of" "business" ...
1.8 Building DTM from first principles
We know that text corpora are organized along broad units called ‘documents’ which we can define. These could be individual reviews, tweets, paragraphs in an unbroken text article etc.
A DTM or *document-term-matrix** is a matrix with one row per document and one column per unique token. Each cell [i, j] in the DTM carries the number of occurrences of token j in document i.
From the above, we know how to tokenize text into word-tokens. Now let’s home-brew a DTM on that basis.
# Merge the two token sets
= unique(c(out.para1$text, out.para2$text))
merged.token.sets length(merged.token.sets)
## [1] 125
Next we pre-define the DTM object. Pay thought to why we should do this.
In R, it’s better to pre-define objects’ sizes so that R can allocate memory at the beginning in a static, stable way.
Functions such as rbind()
or cbind()
dynamically append more rows or columns (respectively) to data objects requiring dynamic memory allocation, which R isn’t great at.
The DTM will have as many rows as there are docs (2 in our ISB example) and as many columns as there are unique tokens. So…
# define the dtm object to populate
= matrix(0, nrow = 2, ncol = length(merged.token.sets))
dtm.fp row.names(dtm.fp) = seq(1:2)
colnames(dtm.fp) = merged.token.sets
dim(dtm.fp)
## [1] 2 125
# define rows in a list
= list(out.para1$text, out.para2$text) docs.list
Now its time to populate the DTM.
There’re many ways to achieve this. I’m using a simple approach - for
loops - that’s intuitive but not efficient. See code below.
# populate the dtm
for (i1 in 1:length(merged.token.sets)){ # looping over tokens
for (i2 in 1:2){ # loop over documents
= length(grep(merged.token.sets[i1], docs.list[[i2]]))
dtm.fp[i2, i1]
# both i1, i2 loops end
}}
1:15] # view first 15 cols of the DTM dtm.fp[,
## indian school of business isb is one the prestigious schools in india and
## 1 4 3 8 3 2 12 2 14 1 1 23 5 8
## 2 2 3 2 3 1 3 0 6 0 1 16 2 1
## rated amongst
## 1 2 1
## 2 0 0
The interpretation of the resulting table is straightforward. The DTM above says that the tokens ‘indian’ and ‘school’ occurred 4 and 3 times respectively in document 1, and {2, 3} times respectively in document 2. And so on.
DTMs are so fundamental to text-An that almost every text-an library in R (or Py, for that matter) will have in-built tokenizer plus DTM builder functions.
And now that we have seen how to home-brew DTMs, we can: * (i) open up these functions to see their innards, and * (ii) change things as required by particular problem contexts.
1.8.1 Building a simple n-gram tokenizer
Next we home-brew n-grams. A single word, say “cream”, is a unigram.
But two (or more) words such as “ice” and “cream” that tend to occur together form a phrase (or part thereof) and can be identified as unique tokens in their own right.
Thus, “ice cream” is a bi-gram whereas “ice cream cone” would be a tri-gram.
Mind you, n-grams aren’t just any 2 or 3 words that occurred somewhere once together but are repeated (partial) phrases that occur multiple times through the corpus.
# build matrix to hold (n-1) bigrams
= length(out.para1$text); n n
## [1] 149
= matrix(0, nrow = (n-1), ncol = 2) bigram.mat
A sentence with 5 words - say, “My name is Amitabh Bachchan” - has 4 sets of consecutive word-pairs in it, namely “My name”, “name is”, “is Amitabh”, and “Amitabh Bachchan”.
More generally, any text with n tokens can have at most (n-1) bigrams in it. Likewise, (n-2) trigrams and so on.
# loop over all tokens now
for (i1 in 1:(n-1)){
1] = out.para1$text[i1]
bigram.mat[i1, 2] = out.para1$text[i1+1]
bigram.mat[i1,
}
head(bigram.mat)
## [,1] [,2]
## [1,] "indian" "school"
## [2,] "school" "of"
## [3,] "of" "business"
## [4,] "business" "isb"
## [5,] "isb" "is"
## [6,] "is" "one"
And now it’s easy to stitch up the two unigram components into a bigram using paste()
, thus:
= paste(bigram.mat[,1], bigram.mat[,2], sep=".")
stitched = data.frame(unigram1 = bigram.mat[,1], unigram2 = bigram.mat[,2], bigram = stitched)
bigram_df head(bigram_df)
## unigram1 unigram2 bigram
## 1 indian school indian.school
## 2 school of school.of
## 3 of business of.business
## 4 business isb business.isb
## 5 isb is isb.is
## 6 is one is.one
While we see many word-pairs, how many are true bi-grams? How about we run a de-duplication on the matrix’s rows and find out?
= unique.matrix(bigram.mat); dim(a0) # how many bigrams occurred multiple times. a0
## [1] 142 2
And the same applies to trigrams as well. See code below.
# the same for trigrams is just as easy
= matrix(0, nrow = (n-2), ncol = 3)
trigram.mat
for (i1 in 1:(n-2)){
1] = out.para1$text[i1]
trigram.mat[i1, 2] = out.para1$text[i1+1]
trigram.mat[i1, 3] = out.para1$text[i1+2]
trigram.mat[i1,
}
head(trigram.mat)
## [,1] [,2] [,3]
## [1,] "indian" "school" "of"
## [2,] "school" "of" "business"
## [3,] "of" "business" "isb"
## [4,] "business" "isb" "is"
## [5,] "isb" "is" "one"
## [6,] "is" "one" "of"
With that, I’ll stop home-brewing here. We’ll next go to standard R libraries and directly use their DTM and n-gram functions here on.
Sudhir