Therefore, just as you transform a categorical color variable, having values such as red, green, and blue, into three binary variables, each one representing one of the three colors, so you can transform a phrase like “The quick brown fox jumps over the lazy dog” using nine binary variables, one for each word that appears in the text (“The” is considered distinct from “the” because of its initial capital letter). This is the Bag of Words (BoW) form of representation.
In its simplest form, BoW shows whether a certain word is present in the text by flagging a specific feature in the dataset. Take a look at an example using Python and its Scikit-learn package.
The input data is three phrases, text_1
, text_2
, and text_3
, placed in a list, corpus
. A corpus is a set of homogenous documents put together for NLP analysis:
text_1 = 'The quick brown fox jumps over the lazy dog.'
text_2 = 'My dog is quick and can jump over fences.'
text_3 = 'Your dog is so lazy that it sleeps all the day.'
corpus = [text_1, text_2, text_3]
When you need to analyze text using a computer, you load the documents from disk or scrape them from the web and place each of them into a string variable. If you have multiple documents, you store them all in a list, the corpus. When you have a single document, you can split it using chapters, paragraphs, or simply the end of each line. After splitting the document, place all its parts into a list and apply analysis as if the list were a corpus of documents.
Now that you have a corpus, you use a class from thefeature_extraction
module in Scikit-learn, CountVectorizer
, which easily transforms texts into BoW like this:from sklearn.feature_extraction import text
vectorizer = text.CountVectorizer(binary=True).fit(corpus)
vectorized_text = vectorizer.transform(corpus)
print(vectorized_text.todense())
[[0 0 1 0 0 1 0 1 0 0 0 1 1 0 1 1 0 0 0 1 0]
[0 1 0 1 0 1 1 0 1 0 1 0 0 1 1 1 0 0 0 0 0]
[1 0 0 0 1 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1]]
The CountVectorizer
class learns the corpus content using the fit
method and then turns it (using the transform
method) into a list of lists. A list of lists is nothing more than a matrix in disguise, so what the class returns is actually a matrix made of three rows (the three documents, in the same order as the corpus) and 21 columns representing the content.
The BoW representation turns words into the column features of a document matrix, and these features have a nonzero value when present in the processed text. For instance, consider the word dog. The following code shows its representation in the BoW:
print(vectorizer.vocabulary_)
{'day': 4, 'jumps': 11, 'that': 18, 'the': 19, 'is': 8,
'fences': 6, 'lazy': 12, 'and': 1, 'quick': 15, 'my': 13,
'can': 3, 'it': 9, 'so': 17, 'all': 0, 'brown': 2,
'dog': 5, 'jump': 10, 'over': 14, 'sleeps': 16,
'your': 20, 'fox': 7}
Asking the CountVectorizer
to print the vocabulary learned from text reports that it associates dog with the number five, which means that dog is the fifth element in the BoW representations. In fact, in the obtained BoW, the fifth element of each document list always has a value of 1 because dog is the only word present in all the tree documents.
Storing documents in a document matrix form can be memory intensive because you must represent each document as a vector of the same length as the dictionary that created it. The dictionary in this example is quite limited, but when you use a larger corpus, you discover that a dictionary of the English language contains well over a million terms. The solution is to use sparse matrices. A sparse matrix is a way to store a matrix in your computer’s memory without having zero values occupying memory space.