Group 4 MovieReview

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10


Course code:15CS84


Submitted By

Rachana R 1SG15CS075

Ramya R 1SG15CS078

Rashmi 1SG15CS081

Rashmi RV 1SG15CS082

Sushma D 1SG15CS113

JAN 2019-FEB 2019

6th MARCH 2019

Analyzing the movie review and classifying it as a positive or
negative based on the given user’s opinion.

The dataset consists of 50000 movie reviews from IMDb. The data is
split evenly with 25k reviews intended for training and 25k for testing your
classifier. Moreover, each set has 12.5k positive and 12.5k negative reviews. IMDb
lets users rate movies on a scale from 1 to 10. To label these reviews the curator of
the data labeled anything with ≤ 4 stars as negative and anything with ≥ 7 stars as
positive. Reviews with 5 or 6 stars were left out.


The data is usually in the text format. People express their opinions on
movies through reviews. These reviews can be single word, line or a paragraph. We
remove stop words like if, actually used in the comments go improve the model and
to easily predict.


The dataset we are using contains two parts:

 Training data
 Testing data
For our first iteration we did very basic text processing like removing
punctuation and HTML tags and making everything lower-case. We can clean
things up further by removing stop words and normalizing the text.
To make these transformations we’ll use libraries from the Natural Language
Toolkit (NLTK). This is a very popular NLP library for python. The data pre-
process involves conversion of given data into numerical values in order to make
machine understand .This process is called Vectorization.


In order for this data to make sense to our machine learning algorithm we’ll
need to convert each review to a numeric representation, which is called
as vectorization. The simplest form of this is to create one very large matrix with
one column for every unique word in corpus (where the corpus is all 50k reviews in
our case). Then we transform each review into one row containing 0s and 1s, where
1 means that the word in the corpus corresponding to that column appears in
that review. That being said, each row of the matrix will be very sparse (mostly
zeros). This process is also known as one hot encoding.
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(binary=True)
X = cv.transform(reviews_train_clean)
X_test = cv.transform(reviews_test_clean)

A common next step in text preprocessing is to normalize the words in your
corpus by trying to convert all of the different forms of a given word into one. Two
methods that exist for this are Stemming and Lemmatization.

Stemming is considered to be the more crude/brute-force approach to
normalization. There is several algorithms, but in general they all use basic rules to
chop off the ends of words.

def get_stemmed_text(corpus):
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
return [' '.join([stemmer.stem(word) for word in review.split()]) for review in corpus]

stemmed_reviews = get_stemmed_text(reviews_train_clean)
Lemmatization works by identifying the part-of-speech of a given word and
then applying more complex rules to transform the word into its true root.

def get_lemmatized_text(corpus):
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
return [' '.join([lemmatizer.lemmatize(word) for word in review.split()]) for review in corpus]

lemmatized_reviews = get_lemmatized_text(reviews_train_clean)


We’ve chosen to represent each review as a very sparse vector with a slot for
every unique n-gram in the corpus. Linear classifiers typically perform better than
other algorithms on data that is represented in this way.

Support Vector Machines (SVM)

Linear classifiers tend to work well on very sparse datasets. Another
algorithm that can produce great results with a quick training time are Support
Vector Machines with a linear kernel. In machine learning, support-vector
machines are supervised learning models with associated learning algorithms that
analyze data used for classification and regression analysis. Given a set of training
examples, each marked as belonging to one or the other of two categories, an SVM
training algorithm builds a model that assigns new examples to one category or the
other, making it a non-probabilistic binary linear classifier.

from sklearn.feature_extrction.text import CountVectorizer
from sklearn.model_selection import train_test_spilt
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC

ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))

X = ngram_vectorizer.transform(reviews_train_clean)
X_test = ngram_vectorizer.transform(reviews_test_clean)
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.75)
for c in [0.001, 0.005, 0.01, 0.05, 0.1]:
svm = LinearSVC(C=c), y_train)
print ("Accuracy for C=%s: %s" % (c, accuracy_score(y_val, svm.predict(X_val))))

final_svm_ngram = LinearSVC(C=0.01), target)
print "Final Accuracy: %s" % accuracy_score(target, final_svm_ngram.predict(X_test)))

Final Model
We found that removing a small set of stop words along with an n-gram range
from 1 to 3 and a linear support vector classifier gave us the best results.

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC

stop_words = ['in', 'of', 'at', 'a', 'the']

ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1, 3), stop_words=stop_words)
X=ngram_vectorizer.t ransform(reviews_train_clean)
X_test = ngram_vectorizer.transform(reviews_test_clean)
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.75)

for c in [0.001, 0.005, 0.01, 0.05, 0.1]:

svm = LinearSVC(C=c), y_train)
print ("Accuracy for C=%s: %s" % (c, accuracy_score(y_val, svm.predict(X_val))))

final = LinearSVC(C=0.01), target)
print ("Final Accuracy: %s" % accuracy_score(target , final.predict(X_test)))

We have transformed our dataset into a format suitable for modeling we can
start building a classifier. Logistic Regression is a good baseline model for us to use
for several reasons:
(1) They’re easy to interpret
(2) linear models tend to perform well on sparse datasets like this one
(3) they learn very fast compared to other algorithms.

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

target = [1 if i < 12500 else 0 for i in range(25000)]

X_train, X_val, y_train, y_val = train_test_split(

X, target, train_size = 0.75)

for c in [0.01, 0.05, 0.25, 0.5, 1]:

lr = LogisticRegression(C=c), y_train)
print ("Accuracy for C=%s: %s"
% (c, accuracy_score(y_val, lr.predict(X_val))))


We have found the optimal value for C, we should train a model using the
entire training set and evaluate our accuracy on the 25k test reviews.
final_model = LogisticRegression(C=0.05), target)
print ("Final Accuracy: %s"
% accuracy_score(target, final_model.predict(X_test)))

We’ve gone over several options for transforming text that can improve the
accuracy of an NLP model. Which combination of these techniques will yield the
best results will depend on the task, data representation, and algorithms you choose.
It’s always a good idea to try out many different combinations to see what works.
Our model was successfully tested against the Multinomial Naïve Bayes, Support
vector machine, Random forest model and obtained better result than those models.
When trying to find the right model + data transformation for your project. we
found that removing a small set of stop words along with an n-gram range from 1 to
3 and a linear support vector classifier gave me the best results.
Accuracy is the fraction of instances that are classical correctly over the total
amount of relevant instances.
Accuracy can be defined as:
Accuracy=tp+tn /tp+tn+fp+fn

import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC
stop_words = ['in', 'of', 'at', 'a', 'the']
ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1, 3),
X = ngram_vectorizer.transform(reviews_train_clean)
X_test = ngram_vectorizer.transform(reviews_test_clean)
X_train, X_val, y_train, y_val = train_test_split(
X, target, train_size = 0.75
for c in [0.001, 0.005, 0.01, 0.05, 0.1]:

svm = LinearSVC(C=c), y_train)
print ("Accuracy for C=%s: %s"
% (c, accuracy_score(y_val, svm.predict(X_val))))

# Accuracy for C=0.001: 0.88784

# Accuracy for C=0.005: 0.89456
# Accuracy for C=0.01: 0.89376
# Accuracy for C=0.05: 0.89264
# Accuracy for C=0.1: 0.8928

final = LinearSVC(C=0.01), target)
print ("Final Accuracy: %s"
% accuracy_score(target, final.predict(X_test)))
# Final Accuracy: 0.90064

It is the fraction of relevant instances that have been retrieved over the total
amount of relevant instances.precision can be defined as:


final_model =
LogisticRegression(C=0.05), target)
print ("Final Accuracy: %s"
% accuracy_score(target, final_model.predict(X_test)))
# Final Accuracy: 0.88128
Recall is the fraction of relevant instances that have been retrieved over the
total amount of relevant instances.

Recall can be defined as:


Discussion and Conclusion

According to above results it has become clear that the model implemented
with the support of the Stanford Core NLP has been much more promising than the
Vowpal Wabbit based approach. Basically with the Stanford Core NLP based
method other than just carrying out normal classification process some important
data preprocessing steps on the movie review data were carried out using the
available tools within the Stanford Core NLP library, such as Stanford Tokenizer
and Stanford Lemmatizer, in order to obtain further improvement of the
performance. Moreover the Stanford Sentiment Analyzer module significantly
supported with the model building process of this approach which considerably
increased the accuracy of the model which was built. Therefore with the discussion
which has been carried out it has been clear that proper preprocessing of data based
on natural language processing approaches as well as incorporating already
existing models in the domain of sentiment analysis altogether with appropriate
classification process can improve the performance of the model for multiclass
classification of movie reviews.To conclude we can say that our model performed
well in classification while multinomial naive Bayes and Random forest stand next
in performance and svm performed poorly.

You might also like