Home

Understanding How Machines Read

|
|  Updated:  
2016-10-06 18:04:45
TensorFlow For Dummies
Explore Book
Buy On Amazon
Recognizing text is an important part of machine learning. Before a computer can do anything with text, it must be able to read the text in some manner. Categorical data is a type of short text that you represent using binary variables, that is, variables coded using one or zero values according to whether a certain value is present in the categorical variable. Not surprisingly, you can represent complex text using the same logic.

Therefore, just as you transform a categorical color variable, having values such as red, green, and blue, into three binary variables, each one representing one of the three colors, so you can transform a phrase like “The quick brown fox jumps over the lazy dog” using nine binary variables, one for each word that appears in the text (“The” is considered distinct from “the” because of its initial capital letter). This is the Bag of Words (BoW) form of representation.

In its simplest form, BoW shows whether a certain word is present in the text by flagging a specific feature in the dataset. Take a look at an example using Python and its Scikit-learn package.

The input data is three phrases, text_1, text_2, and text_3, placed in a list, corpus. A corpus is a set of homogenous documents put together for NLP analysis:

text_1 = 'The quick brown fox jumps over the lazy dog.'

text_2 = 'My dog is quick and can jump over fences.'

text_3 = 'Your dog is so lazy that it sleeps all the day.'

corpus = [text_1, text_2, text_3]

When you need to analyze text using a computer, you load the documents from disk or scrape them from the web and place each of them into a string variable. If you have multiple documents, you store them all in a list, the corpus. When you have a single document, you can split it using chapters, paragraphs, or simply the end of each line. After splitting the document, place all its parts into a list and apply analysis as if the list were a corpus of documents.

Now that you have a corpus, you use a class from the feature_extraction module in Scikit-learn, CountVectorizer, which easily transforms texts into BoW like this:

from sklearn.feature_extraction import text

vectorizer = text.CountVectorizer(binary=True).fit(corpus)

vectorized_text = vectorizer.transform(corpus)

print(vectorized_text.todense())

[[0 0 1 0 0 1 0 1 0 0 0 1 1 0 1 1 0 0 0 1 0]

[0 1 0 1 0 1 1 0 1 0 1 0 0 1 1 1 0 0 0 0 0]

[1 0 0 0 1 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1]]

The CountVectorizer class learns the corpus content using the fit method and then turns it (using the transform method) into a list of lists. A list of lists is nothing more than a matrix in disguise, so what the class returns is actually a matrix made of three rows (the three documents, in the same order as the corpus) and 21 columns representing the content.

The BoW representation turns words into the column features of a document matrix, and these features have a nonzero value when present in the processed text. For instance, consider the word dog. The following code shows its representation in the BoW:

print(vectorizer.vocabulary_)

{'day': 4, 'jumps': 11, 'that': 18, 'the': 19, 'is': 8,

'fences': 6, 'lazy': 12, 'and': 1, 'quick': 15, 'my': 13,

'can': 3, 'it': 9, 'so': 17, 'all': 0, 'brown': 2,

'dog': 5, 'jump': 10, 'over': 14, 'sleeps': 16,

'your': 20, 'fox': 7}

Asking the CountVectorizer to print the vocabulary learned from text reports that it associates dog with the number five, which means that dog is the fifth element in the BoW representations. In fact, in the obtained BoW, the fifth element of each document list always has a value of 1 because dog is the only word present in all the tree documents.

Storing documents in a document matrix form can be memory intensive because you must represent each document as a vector of the same length as the dictionary that created it. The dictionary in this example is quite limited, but when you use a larger corpus, you discover that a dictionary of the English language contains well over a million terms. The solution is to use sparse matrices. A sparse matrix is a way to store a matrix in your computer’s memory without having zero values occupying memory space.

About This Article

This article is from the book: 

About the book author:

John Paul Mueller is a freelance author and technical editor. He has writing in his blood, having produced 100 books and more than 600 articles to date. The topics range from networking to home security and from database management to heads-down programming. John has provided technical services to both Data Based Advisor and Coast Compute magazines.

Luca Massaron is a data scientist specialized in organizing and interpreting big data and transforming it into smart data by means of the simplest and most effective data mining and machine learning techniques. Because of his job as a quantitative marketing consultant and marketing researcher, he has been involved in quantitative data since 2000 with different clients and in various industries, and is one of the top 10 Kaggle data scientists.