Group 4 MovieReview
Group 4 MovieReview
Group 4 MovieReview
Course code:15CS84
TCS ION
Submitted By
Rachana R 1SG15CS075
Ramya R 1SG15CS078
Rashmi 1SG15CS081
Rashmi RV 1SG15CS082
Sushma D 1SG15CS113
DATA
The dataset consists of 50000 movie reviews from IMDb. The data is
split evenly with 25k reviews intended for training and 25k for testing your
classifier. Moreover, each set has 12.5k positive and 12.5k negative reviews. IMDb
lets users rate movies on a scale from 1 to 10. To label these reviews the curator of
the data labeled anything with ≤ 4 stars as negative and anything with ≥ 7 stars as
positive. Reviews with 5 or 6 stars were left out.
Training data
Testing data
TEXT PREPROCESSING
For our first iteration we did very basic text processing like removing
punctuation and HTML tags and making everything lower-case. We can clean
things up further by removing stop words and normalizing the text.
To make these transformations we’ll use libraries from the Natural Language
Toolkit (NLTK). This is a very popular NLP library for python. The data pre-
process involves conversion of given data into numerical values in order to make
machine understand .This process is called Vectorization.
VECTORIZATION
In order for this data to make sense to our machine learning algorithm we’ll
need to convert each review to a numeric representation, which is called
as vectorization. The simplest form of this is to create one very large matrix with
one column for every unique word in corpus (where the corpus is all 50k reviews in
our case). Then we transform each review into one row containing 0s and 1s, where
1 means that the word in the corpus corresponding to that column appears in
that review. That being said, each row of the matrix will be very sparse (mostly
zeros). This process is also known as one hot encoding.
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary=True)
cv.fit(reviews_train_clean)
X = cv.transform(reviews_train_clean)
X_test = cv.transform(reviews_test_clean)
NORMALIZATION
A common next step in text preprocessing is to normalize the words in your
corpus by trying to convert all of the different forms of a given word into one. Two
methods that exist for this are Stemming and Lemmatization.
STEMMING
Stemming is considered to be the more crude/brute-force approach to
normalization. There is several algorithms, but in general they all use basic rules to
chop off the ends of words.
def get_stemmed_text(corpus):
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
return [' '.join([stemmer.stem(word) for word in review.split()]) for review in corpus]
stemmed_reviews = get_stemmed_text(reviews_train_clean)
LEMMATIZATION
Lemmatization works by identifying the part-of-speech of a given word and
then applying more complex rules to transform the word into its true root.
def get_lemmatized_text(corpus):
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
return [' '.join([lemmatizer.lemmatize(word) for word in review.split()]) for review in corpus]
lemmatized_reviews = get_lemmatized_text(reviews_train_clean)
ALGORITHMS
We’ve chosen to represent each review as a very sparse vector with a slot for
every unique n-gram in the corpus. Linear classifiers typically perform better than
other algorithms on data that is represented in this way.
Algorithm:
from sklearn.feature_extrction.text import CountVectorizer
from sklearn.model_selection import train_test_spilt
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC
final_svm_ngram = LinearSVC(C=0.01)
final_svm_ngram.fit(X, target)
print "Final Accuracy: %s" % accuracy_score(target, final_svm_ngram.predict(X_test)))
Final Model
We found that removing a small set of stop words along with an n-gram range
from 1 to 3 and a linear support vector classifier gave us the best results.
final = LinearSVC(C=0.01)
final.fit(X, target)
print ("Final Accuracy: %s" % accuracy_score(target , final.predict(X_test)))
BUILD CLASSIFIER
We have transformed our dataset into a format suitable for modeling we can
start building a classifier. Logistic Regression is a good baseline model for us to use
for several reasons:
(1) They’re easy to interpret
(2) linear models tend to perform well on sparse datasets like this one
(3) they learn very fast compared to other algorithms.
lr = LogisticRegression(C=c)
lr.fit(X_train, y_train)
print ("Accuracy for C=%s: %s"
% (c, accuracy_score(y_val, lr.predict(X_val))))
Summary
We’ve gone over several options for transforming text that can improve the
accuracy of an NLP model. Which combination of these techniques will yield the
best results will depend on the task, data representation, and algorithms you choose.
It’s always a good idea to try out many different combinations to see what works.
Our model was successfully tested against the Multinomial Naïve Bayes, Support
vector machine, Random forest model and obtained better result than those models.
When trying to find the right model + data transformation for your project. we
found that removing a small set of stop words along with an n-gram range from 1 to
3 and a linear support vector classifier gave me the best results.
Accuracy
Accuracy is the fraction of instances that are classical correctly over the total
amount of relevant instances.
Accuracy can be defined as:
Accuracy=tp+tn /tp+tn+fp+fn
from
sklearn.feature_extraction.text
import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC
stop_words = ['in', 'of', 'at', 'a', 'the']
ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1, 3),
stop_words=stop_words)
ngram_vectorizer.fit(reviews_train_clean)
X = ngram_vectorizer.transform(reviews_train_clean)
X_test = ngram_vectorizer.transform(reviews_test_clean)
X_train, X_val, y_train, y_val = train_test_split(
X, target, train_size = 0.75
)
for c in [0.001, 0.005, 0.01, 0.05, 0.1]:
svm = LinearSVC(C=c)
svm.fit(X_train, y_train)
print ("Accuracy for C=%s: %s"
% (c, accuracy_score(y_val, svm.predict(X_val))))
final = LinearSVC(C=0.01)
final.fit(X, target)
print ("Final Accuracy: %s"
% accuracy_score(target, final.predict(X_test)))
# Final Accuracy: 0.90064
Precision
It is the fraction of relevant instances that have been retrieved over the total
amount of relevant instances.precision can be defined as:
Precision=tp/tp+fp
final_model =
LogisticRegression(C=0.05)
final_model.fit(X, target)
print ("Final Accuracy: %s"
% accuracy_score(target, final_model.predict(X_test)))
# Final Accuracy: 0.88128
Recall
Recall is the fraction of relevant instances that have been retrieved over the
total amount of relevant instances.
Recall=tp/tp+fn