You can discover the topics in a document in different ways. The simplest approach is prompted by the idea that if a group of people talks or writes about a topic, the people tend to use words from a limited vocabulary because they refer or relate to the same topic. When you share some meaning or are part of the same group, you tend to use the same language.
Consequently, if you have a collection of texts and don’t know what topics the text references, you can reverse the previous reasoning — you can simply look for groups of words that tend to associate, so their newly formed group by dimensionality reduction may hint at the topics you’d like to know about. This is a typical unsupervised learning task.This learning task is a perfect application for the singular value decomposition (SVD) family of algorithms because by reducing the number of columns, the features (which, in a document, are the words) will gather in dimensions, and you can discover the topics by checking high-scoring words. SVD and Principal Components Analysis (PCA) provide features to relate both positively and negatively to the newly created dimensions.
So a resulting topic may be expressed by the presence of a word (high positive value) or by the absence of it (high negative value), making interpretation both tricky and counterintuitive for humans. The Scikit-learn package includes the Non-Negative Matrix Factorization (NMF) decomposition class, which allows an original feature to relate only positively with the resulting dimensions.
This example starts with a new experiment after loading the 20newsgroups
dataset, a dataset collecting newsgroup postings scraped from the web, selecting only the posts regarding objects for sale and automatically removing headers, footers, and quotes. You may receive a warning message to the effect of, WARNING:sklearn.datasets.twenty_newsgroups:Downloading dataset from …
, with the URL of the site used for the download when working with this code.
import warnings
warnings.filterwarnings("ignore")
from sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups(shuffle=True,
categories = ['misc.forsale'],
remove=('headers', 'footers', 'quotes'), random_state=101)
print ('Posts: %i' % len(dataset.data))
Posts: 585
The TfidVectorizer
class is imported and set up to remove stop words (common words such as the or and) and keep only distinctive words, producing a matrix whose columns point to distinct words.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=0.95,
min_df=2, stop_words='english')
tfidf = vectorizer.fit_transform(dataset.data)
from sklearn.decomposition import NMF
n_topics = 5
nmf = NMF(n_components=n_topics, random_state=101).fit(tfidf)
Term frequency-inverse document frequency (TF-IDF) is a simple calculation based on the frequency of a word in the document. It is weighted by the rarity of the word between all the documents available. Weighting words is an effective way to rule out words that cannot help you to classify or identify the document when processing text. For example, you can eliminate common parts of speech or other common words.
As with other algorithms from thesklearn.decomposition
module, the n_components
parameter indicates the number of desired components. If you’d like to look for more topics, you use a higher number. As the required number of topics increases, the reconstruction_err_ method
reports lower error rates. It’s up to you to decide when to stop given the trade-off between more time spent on computations and more topics.The last part of the script outputs the resulting five topics. By reading the printed words, you can decide on the meaning of the extracted topics, thanks to product characteristics (for instance, the words drive, hard, card, and floppy refer to computers) or the exact product (for instance, comics, car, stereo, games).
feature_names = vectorizer.get_feature_names()
n_top_words = 15
for topic_idx, topic in enumerate(nmf.components_):
print ("Topic #%d:" % (topic_idx+1),)
print (" ".join([feature_names[i] for i in
topic.argsort()[:-n_top_words - 1:-1]]))
Topic #1:
drive hard card floppy monitor meg ram disk motherboard vga scsi brand
color internal modem
Topic #2:
00 50 dos 20 10 15 cover 1st new 25 price man 40 shipping comics
Topic #3:
condition excellent offer asking best car old sale good new miles 10 000
tape cd
Topic #4:
email looking games game mail interested send like thanks price package
list sale want know
Topic #5:
shipping vcr stereo works obo included amp plus great volume vhs unc mathes
gibbs radley
You can explore the resulting model by looking into the attribute components_
from the trained NMF model. It consists of a NumPy ndarray
holding positive values for words connected to the topic. By using the argsort
method, you can get the indexes of the top associations, whose high values indicate that they are the most representative words.
print (nmf.components_[0,:].argsort()[:-n_top_words-1:-1])
# Gets top words for topic 0
[1337 1749 889 1572 2342 2263 2803 1290 2353 3615 3017 806 1022 1938
2334]
Decoding the words’ indexes creates readable strings by calling them from the array derived from the get_feature_names
method applied to the TfidfVectorizer
that was previously fitted.
print (vectorizer.get_feature_names()[1337])
# Transforms index 1337 back to text
drive