Home

Using Machine Learning to Analyze Reviews from E-Commerce

|
|  Updated:  
2016-10-06 19:15:24
TensorFlow For Dummies
Explore Book
Buy On Amazon
Machine learning will have a significant impact on e-commerce. Sentiment is difficult to catch because humans use the same words to express even opposite sentiments. The expression you convey is a matter of how you construct your thoughts in a phrase, not simply the words used. Even though dictionaries of positive and negative words do exist and are helpful, they aren’t decisive because word context matters. You can use these dictionaries as a way to enrich textual features, but you have to rely more on machine learning if you want to achieve good results.

It’s a good idea to see how positive and negative word dictionaries work. The AFINN-111 dictionary contains 2,477 positive and negative words and phrases. Another good choice is the larger opinion lexicon by Hu and Liu. Both dictionaries contain English words.

Using labeled examples that associate phrases to sentiments can create more effective predictors. In this example, you create a machine learning model based on a dataset containing reviews from Amazon, Yelp, and IMDB that you can find at the UCI, the machine learning repository.

This dataset was created for the paper “From Group to Individual Labels Using Deep Features,” by Kotzias et al., for KDD 2015. The dataset contains 3,000 labeled reviews equally divided from the three sources, and the data has a simple structure. Some text is separated by a tab from a binary sentiment label where 1 is a positive sentiment and 0 a negative one. You can download the dataset and place it in your Python working directory using the following commands:

try:

import urllib2 # Python 2.7.x

except:

import urllib.request as urllib2 # Python 3.x

import requests, io, os, zipfile

UCI_url = 'https://archive.ics.uci.edu/ml/\

machine-learning-databases/00331/sentiment%20\

labelled%20sentences.zip'

response = requests.get(UCI_url)

compressed_file = io.BytesIO(response.content)

z = zipfile.ZipFile(compressed_file)

print ('Extracting in %s' % os.getcwd())

for name in z.namelist():

filename = name.split('/')[-1]

nameOK = ('MACOSX' not in name and '.DS' not in name)

if filename and nameOK:

newfile = os.path.join(os.getcwd(),

os.path.basename(filename))

with open(newfile, 'wb') as f:

f.write(z.read(name))

print ('\tunzipping %s' % newfile)

In case the previous script doesn’t work, you can download the data (in zip format) and expand it using your favorite unzipper. You’ll find the imdb_labelled.txt file inside the newly created sentiment labelled sentences directory. After downloading the files, you can upload the IMDB file to a pandas DataFrame by using the read_csv function.

import numpy as np

import pandas as pd

dataset = 'imdb_labelled.txt'

data = pd.read_csv(dataset, header=None, sep=r"\t",

engine='python')

data.columns = ['review','sentiment']

Exploring the textual data is quite interesting. You’ll find all short phrases such as “Wasted two hours” or “It was so cool.” Some are clearly ambiguous for a computer, such as “Waste your money on this game.” Even though waste has a negative meaning, the imperative makes the phrase sound positive. A machine learning algorithm can learn to decipher ambiguous phrases like these only after seeing many variants. The next step is to build the model by splitting the data into training and test sets.

from sklearn.cross_validation import train_test_split

corpus, test_corpus, y, yt = train_test_split(

data.ix[:,0], data.ix[:,1],

test_size=0.25, random_state=101)

After splitting the data, the code transforms the text using several NLP techniques: token counts, unigrams and bigrams, stop words removal, text length normalization, and TF-IDF transformation.

from sklearn.feature_extraction import text

vectorizer = text.CountVectorizer(ngram_range=(1,2),

stop_words='english').fit(corpus)

TfidF = text.TfidfTransformer()

X = TfidF.fit_transform(vectorizer.transform(corpus))

Xt = TfidF.transform(vectorizer.transform(test_corpus))

After the text for both the training and test sets is ready, the algorithm can learn sentiment using a linear support vector machine. This kind of support vector machine supports L2 regularization, so the code must search for the best C parameter using the grid search approach.

from sklearn.svm import LinearSVC

from sklearn.grid_search import GridSearchCV

param_grid = {'C': [0.01, 0.1, 1.0, 10.0, 100.0]}

clf = GridSearchCV(LinearSVC(loss='hinge',

random_state=101), param_grid)

clf = clf.fit(X, y)

print ("Best parameters: %s" % clf.best_params_)

Best parameters: {'C': 1.0}

Now that the code has determined the best hyper-parameter for the problem, you can test performance on the test set using the accuracy measure, the percentage of correct times the code can guess the correct sentiment.

from sklearn.metrics import accuracy_score

solution = clf.predict(Xt)

print("Achieved accuracy: %0.3f" %

accuracy_score(yt, solution))

Achieved accuracy: 0.816

The results indicate accuracy of higher than 80 percent, but determining which phrases tricked the algorithm into making a wrong prediction is interesting. You can print the misclassified texts and consider what the learning algorithm is missing in terms of learning from text.

print(test_corpus[yt!=solution])

601 There is simply no excuse for something this p...

32 This is the kind of money that is wasted prope...

887 At any rate this film stinks, its not funny, a...

668 Speaking of the music, it is unbearably predic...

408 It really created a unique feeling though.

413 The camera really likes her in this movie.

138 I saw "Mirrormask" last night and it was an un...

132 This was a poor remake of "My Best Friends Wed...

291 Rating: 1 out of 10.

904 I’m so sorry but I really can't recommend it t...

410 A world better than 95% of the garbage in the ...

55 But I recommend waiting for their future effor...

826 The film deserves strong kudos for taking this...

100 I don't think you will be disappointed.

352 It is shameful.

171 This movie now joins Revenge of the Boogeyman ...

814 You share General Loewenhielm's exquisite joy ...

218 It's this pandering to the audience that sabot...

168 Still, I do like this movie for it's empowerme...

479 Of course, the acting is blah.

31 Waste your money on this game.

805 The only place good for this film is in the ga...

127 my only problem is I thought the actor playing...

613 Go watch it!

764 This movie is also revealing.

107 I love Lane, but I’ve never seen her in a movi...

674 Tom Wilkinson broke my heart at the end... and...

30 There are massive levels, massive unlockable c...

667 It is not good.

823 I struggle to find anything bad to say about i...

739 What on earth is Irons doing in this film?

185 Highly unrecommended.

621 A mature, subtle script that suggests and occa...

462 Considering the relations off screen between T...

595 Easily, none other cartoon made me laugh in a ...

8 A bit predictable.

446 I like Armand Assante & my cable company's sum...

449 I won't say any more - I don't like spoilers, ...

715 Im big fan of RPG games too, but this movie, i...

241 This would not even be good as a made for TV f...

471 At no point in the proceedings does it look re...

481 And, FINALLY, after all that, we get to an end...

104 Too politically correct.

522 Rating: 0/10 (Grade: Z) Note: The Show Is So B...

174 This film has no redeeming features.

491 This movie creates its own universe, and is fa...

Name: review, dtype: object

About This Article

This article is from the book: 

About the book author:

John Paul Mueller is a freelance author and technical editor. He has writing in his blood, having produced 100 books and more than 600 articles to date. The topics range from networking to home security and from database management to heads-down programming. John has provided technical services to both Data Based Advisor and Coast Compute magazines.

Luca Massaron is a data scientist specialized in organizing and interpreting big data and transforming it into smart data by means of the simplest and most effective data mining and machine learning techniques. Because of his job as a quantitative marketing consultant and marketing researcher, he has been involved in quantitative data since 2000 with different clients and in various industries, and is one of the top 10 Kaggle data scientists.