Classification of Airline Tweet Using Nave-Bayes Classifier For Sentiment Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

2019 International Conference on Information Technology (ICIT)

Classification of Airline Tweet using Naïve-Bayes classifier for Sentiment Analysis


Nand Kishore Sharma Prof. (Dr.) Surendra Sachin Sharma
Ph.D. Scholar, Amity School of Rahamatkar Department of Information
Technology
Engineering & Technology Amity School of Engineering & Chameli Devi Group of
Amity University Chhattisgarh Technology Institutions, Indore (M.P.) India
Raipur, India Amity University Chhattisgarh [email protected]
[email protected] Raipur, India
[email protected]

Abstract—Wide range of customers like family persons, This work examined different existing solutions with
business man, sportsman and youth are traveling via Airline. observing data classification algorithm, these approaches
Hence feedback of persons matters a lot if they are involved. are the major approach for feature extraction deriving
Direct feedbacks of customer may be positive or negative but
sentiments and opinions of different users. Here, Naïve
analyzing their Tweets is important for the betterment i.e.
Byes Classifier is considered to classify user opinion
how the Tweets is? Analysis of individual tweet is so much
critical if the volume is high. Most of the times Tweets are
[tweets].
ambiguous, which depends on the nature of customers i.e. Sentiment based approach is presented in this paper for
positive person will always give positive Tweets and in other calculating polarity of ambiguous data. NLP is the Natural
side negative tweets come from negative person. So
Language Processing whose important task are Sentimental
ultimately, our work is to find the sentiments of descriptive
Analysis or Opinion mining. It has gained a wide attention
Tweets as a result via their words and their expression in
quantities format whether they are happy or not. The novel in the modern period. Sentimental polarity categorization
factor inside this work is to examine ambiguous Tweets and problem is tackled in this paper, and it is the fundamental
neutralize them according to proposed algorithm. issue of sentimental analysis. Experimental approach for
The complete work is based on twitter dataset, here we sentimental polarity categorization is proposed in
are using US airline and performs different level of mining descriptive way.
and processing for getting most accurate results. Here,
improved sentiment analysis model has been proposed based Based on contextual ambiguity issue arises of polarity
on naïve Bayes classifier to classify tweets based on calculation for sentimental analysis. Different polarity is
sentiments and neutralized tweets from ambiguous to calculated for different context using opinion keywords, it is
positive or negative. a big challenge for researchers in the area of sentimental
Keywords—Sentimental analysis; Ambiguous analysis. This issue of polarity is resolved ineffectively
words; Airline dataset; Naïve-Bayes classifier from term-level features.
I. INTRODUCTION Earlier, opinion mining effectively deals in progressive way
Sentiment analysis is the only innovative technique that with different scenario by developing document level
facilitates to investigate the thinking and thoughts of the any analysis [1][2][3]. In sentimental analysis many issues arise
kind of user. It can be defined as: “Sentiment analysis, also with subjective detection and sentiment classification.
called opinion mining, is the field of study that analyzes
In general, way sentimental analysis focuses on determining
people’s opinions, sentiments, evaluations, appraisals,
writer’s attitude with respect to complete context of
attitudes, and emotions towards entities such as products,
document with calculating polarity. This attitude of speaker
services, organizations, individuals, issues, events, topics,
and writer evaluates user’s judgment in a state of emotions
and their attributes.”
of writer when writing with emotional communication.
Twitter data or social networking sites are to know about
Important task of sentimental analysis is the classification of
behavior and opinion of users. It may help to track the
polarity of text in a document and also feature level is
interest and connectivity of user with respective of their
expressed through opinion of document either it’s positive
viewpoint. Let’s consider example of feedback system,
or negative or neutral. Sentiment classification for polarity
where opinion and different viewpoint of users can observe
in an emotional state can be as sad, happy, or angry.
from their feedback. Feedback system lies on the concept of
marking leads to calculate the relationship on basis of II. LITERATURE SURVEY
various points, despite descriptive feedback can also plays A. Existing Work
important role to understand best way of opinion. Trupthi et al. [1] described about sentimental analysis and
Descriptive or subjective feedback can help to understand opinion mining, which explores about users’ sentiments and
the demand and need. It can be written as subjective form of thinking. Author uses Naive Bayes classifier for
feedback to express thought or user view. Thus, it may be examination of real data twitter data and for the
considered as the great source for analysis.

978-1-7281-6052-8/19/$31.00 ©2019 IEEE 70


DOI 10.1109/ICIT48102.2019.00019
Authorized licensed use limited to: Somaiya University. Downloaded on December 28,2023 at 13:29:24 UTC from IEEE Xplore. Restrictions apply.
classification of data and feature extraction, unigram also process and analyze and make decisions. Author used
approach is used. twitter data and worked it on Hadoop using spark.
• For user thinking and eventual sentiments, opinion Amit Gupte et al. [8] comparison is made by author with
mining and sentimental analysis is used. using Random Forest, Maximum Entropy, Boosted tree and
• Twitter data is considered as the main source for Naive Bayes approaches. Random forest classifier shows
examining real data using unigram for data greater performance and accuracy in sentimental
classification and feature extraction and also by using classification. It provides benefit of ease to understand with
Naive Bayes. the improvement in time. Decision tree provides with high
• Sentiment of word and its polarity is calculated in rate of accuracy improvement. More power is required with
this work by considering positive tweets, degree of processing and training time.
positive words, positive words and complete tweets If only considered with accuracy, then Random Forest
are observed. classifier is taken as priority. With less processing and
B. Related Work small memory, Naive Bayes classifier is used. Maximum
Cambria E, Olsher D, Schuller B et al. In[2][3][4] Entropy classifier is used with large processing time, small
introduces an effective progress in opinion mining which is training time and large memory. SVM is concluded as
deals better earlier. Many different scenarios are developed better by author because of its higher accuracy output.
to deal with it in recent year with achieving document level Suchita V Wawre et al. [9] proposed comparison between
sentimental analysis. Sentimental analysis faces many SVM and Naive Bayes classifier. These classifiers are the
issues for subjective detection and classification. Turney et supervised machine-learning algorithm for sentimental
al. [5] worked on method called as bag-of-word method, analysis. There is no comparison with each other if they
where relationships among words in a sentence are not have less Tweets. Alessia D’Andrea et al. [10] introduce
considered for sentimental analysis. Sentence is a collection different tools and approaches with sentimental analysis
of words, sentiment of the complete sentence is described as for classification. Using machine learning with lexicon
determining sentiment of every single word, and using some based, author uses an approach called as Hybrid approach
functions, values are aggregated. and is used for higher classification and performance. An
Divya Sehgal et al. In[6] researched about data analytics for exact design is provided for the capabilities of machine
twitter data and sentiment analysis is used here broadly. learning as a fundamental part. Hailong Zhang et al. [11]
Some of the useful aspects are covered by author like large compared machine-learning approach with lexicon based
amount of twitter data used for business and social purpose for sentimental analysis. For comparative analysis cross-
on the basis of requirements data processing. Size of this domain and cross, lingual method based on sentimental
data is very large which is increasing every second and this analysis is used. Both methods provides with different
large data is handled using Hadoop. Large sized tweet data benefits and are competing in their own way. High
is analyzed using Hadoop. Data efficiency and accuracy is provided by machine learning technique. For
improvement in data analysis regulates data values. On the labeling document, manually Lexicon-based approach is
basis of Tweets, comments, blogs of social media, project is required. Turney et al. [12] explains about word polarity
expanded. Its accuracy totally depends on values. disambiguation with the sentiment words in ambiguous to
M. Mazhar Rathore et al. In[7] introduces about Geosocial eliminate easily from real world application. Author
Networks which is government liability in terms of safety designed an algorithm called PMI (point wise mutual
from any kind of disasters. Providing with proper facility information algorithm) with Tweets and calculates
from management and decrease in risk of the spread of sentiments for sentiment word.
infection. Common citizens are recommended by system
III. PROBLEM DOMAIN
and also provided with recommended system, healthcare
system etc. And also new products can be launched in Citizen’s behavior and opinion is one of the important
part in the field of sentiment analysis. It can help to
different fields by monitoring geosocial data of specific
establish strong association between user desire and
location. For better analysis benefits are provided for
viewpoint. It is a good way to explore fine or minute scope
employing significant data generation from geosocial
network. A high computing capabilities with better analysis of improvement. Tweets can helps in knowing the view of
and advanced technology is possible. That is why, a system user against any incidence or events. Strong sentiment
is proposed by author with better planning, provided with analysis can lead to improvement of current trend
scenarios. For such reasons, sentiment analysis of user
proper management and safety from disasters. This system
viewpoint becomes mandatory for all citizens.
provides with high speed data in geosocial networks and

71

Authorized licensed use limited to: Somaiya University. Downloaded on December 28,2023 at 13:29:24 UTC from IEEE Xplore. Restrictions apply.
Although, Existing system provides good way to Airline Tweet Dataset from Kaggle data repository. A link
explore user sentiments but still suffer with certain to detail description is cited below;
limitations: https://www.kaggle.com/crowdflower/twitter-airline-
• It considers words as individual effort and sentiment#Tweets.csv
tokenize sentence in form of words. Step 2: Data Pre-processing:
• They state, “Future enhancement to this work a. Data Cleaning: Data cleaning is done to remove
might be to use n-gram classification rather than redundant data, irrelevant tweets, image short path &
limiting to uni-gram” to overcome this issue. unwanted link.
• Existing solution has used Uni-gram Naïve b. Lemmatization: Stanford Lemmatize Library is used.
classifier, which considers word probabilities for Lemmatization refers to vocabulary and morphological
training, and testing purposed both. words. It removes infected word and concentrate on
• Sentiment analysis of whole sentence can help to words involved in dictionary.
reach more close to user viewpoint. c. Tokenization: Here, sensitive data is replaced with
• At time of classification, they consider individual non-sensitive equivalent data and are represented as
word rather actual sentence. They also suggest to tokens.
use on n-gram Naïve classifier rather uni-gram. Step 3: Proposed Estimation:
• Authors do not consider ambiguity as serious a. Ambiguity based polarity estimation: To check polarity
problem and only concentrate to improve of ambiguity word. Whether the word is positive,
performance of sentiment analysis. negative, partial positive, partial negative or neutral.
b. This work will initially classify Tweets in three
• Existing solution may be compromised for
categories which are positive, negative and ambiguous.
taunting tweets or indirect comments which
Example of ambiguous Tweets is cited below;
comes with positive words but used in negative
Ambiguous tweet: “This is good airline service but
purpose.
comes with high tariff. They do not provide good
With using different algorithms, different phrases can
service. Best part of this airline is they are always
be identified with positive or negative annotation based on
available. “
frequency of occurrence of phrase and through it; its
Positive tweet: “This one is the best airline comes with
weight can be calculated based on phrases. Tweets are the
awesome services and facility. 100 % recommendation
online data of social site, which contains variety of flaws
for new once”
with the probability to hinder sentiment analysis.
Negative tweet: “Worst services experience. Air
Limitations arises here is quality of opinion because of
hostess don’t know about their responsibilities”
the right to freely post, meaningless contents due to online
c. This work will investigate ambiguous tweets and
spammers. Another one is truth; only opinions are taken as
observe sentiments of every word. After observing
truth, which can be as positive, negative or neutral.
sentiment for every individual, it also checks sentiment
Sentimental polarity is classified as sentence-level and
of previous and next word. In case of ambiguous word,
Tweets-level, where sentence level define sentiments
it fixes the word polarity based on occurrence of
through positive and negative terms. Data required for
previous and next whether they are positive or
sentence-level needs truth facts to convey sentence, and
negative.
this the significant which arises as the truth.
d. The complete Tweets will be forwarded to Naïve
IV. PROPOSED METHODOLOGY Bayes classifier.
Proposed work includes some modules in which flow Step 4: Classifier:
of the work is explained through proposed architecture. a. Naïve Bayes classifier: It is used for large volume of
Step involved in proposed methodology are as: dataset. Naïve Bayes classifier supposes that features
Functionality and Design: are independent with values of given class labels. It
• Data Acquisition will apply unigram and n-gram technique and
• Human labelling concludes results based on both techniques.
• Feature Extraction Step 5: Tweets Classification:
a. Sentiment calculation: polarity of sentiments is
• Classification
calculated.
Step 1: Data Collection and Preparation:
Data can be collected either directly from the user or
from the existing system. In proposed work, we have used

72

Authorized licensed use limited to: Somaiya University. Downloaded on December 28,2023 at 13:29:24 UTC from IEEE Xplore. Restrictions apply.
12. set threshold value & Naive Bayse's Classification
and run unigram
13. set threshold value & Naive Byse's Classification
and run n-gram
14. Calculate average of 12 & 13 and classify tweets.
15. Classification of positive & Negative Tweets

V. EXPERIMENT ANALYSIS
The complete result examination has been performed for
five different data sample. Proposed solution has been
examine and compared with previous results to justify that
proposed solution perform better than existing solutions.
Here, size of different data samples has been shown in
table 1.

TABLE I. DATASET

To observe the accuracy of proposed solution precision


and recall has been examined. Here, total tweets,
recommended and retrieved tweets have been calculated
Fig 1: Proposed Architecture to estimate precision and recall parameters of proposed
solution. Following formulas are used to estimate
The complete work has been represented into Pseudo code precision, recall and f-score.
formation which is shown below;
Precision = Total Recommended Tweets
1. Consider AirLine Dataset as Source of Input Total Number of Tweets Retrieved
2. Input: Set variables tweet_id,airline_sentiment
Recall/Accuracy = Total Recommended Tweets
airline_sentiment_confidence,negativereason,nega
Total Number of Relevant Tweets Retrieved
tvereason_confidence
airline,airline_sentiment_gold,name,negativereason
_gold, F-Score = 2 * ((Precision * Recall)/ (Precision + Recall))
Title Size Total Number of Tweets
retweet_count,text,tweet_coord,tweet_created,tweet
_location, user_timezone Input 1 100 KB 1325
2859
3. Define Array Stopwords[]= {is, am, are, an, this} Input 2 200 KB
Input 3 300 KB 4256
4. Remove Stop words
Input 4 500 KB 7256
5. Perform Lemmatization
Input 5 1 MB 13569
6. Perform Tokenization
7. Set variable pre_word, post_word, curr_word The complete performance observation are shown in
8. Check if(polarity[curr_word]==ambigious ) Table 2,3,4 and relevant graphs.
9. Recheck if (polarity[pre_word]==positive ) TABLE II. PRECISION

{ set: curr_word = positive }


Title Recall
else
Input 1 0.74
{ set: curr_word = negative }
Input 2 0.75
10. Replace word by polarity and estimate weight by Input 3 0.77
individual Tweets for ambiguous Tweets
classification Input 4 0.84

11. If (positive Tweets weight > negative Tweets Input 5 0.87


weight)
{Tweets positive } else {Tweets negative}

73

Authorized licensed use limited to: Somaiya University. Downloaded on December 28,2023 at 13:29:24 UTC from IEEE Xplore. Restrictions apply.
TABLE V. COMPUTATION TIME

Title Computation Time (ms)


Input 1 1568

Input 2 2459

Input 3 3569

Input 4 4963

Input 5 8050

Fig. 2. Precision Comparison graph

TABLE III. RECALL / ACCURACY F-SCORE

Title Recall
Input 1 0.89
Input 2 0.91
Input 3 0.92
Input 4 0.93
Input 5 0.96

Fig. 5. Computation Time Comparison graph

VI. CONCLUSION & FUTURE WORK


The Complete work is based on Twitter Dataset for
Airline Sentiments, where Airline dataset is used as
source input with analyzing and preprocessing the data to
find sentiments of words. This work proposed sentiment
Fig. 3. Recall Comparison graph analysis model to observe positive and negative viewpoint
of different users based on sentimental analysis approach.
TABLE IV. F-SCORE
The complete work perform pre-processing and cleaning
Title F-Score
of data using Naïve Bayes classifier, the sentiments of
Input 1 0.79 airline tweets are estimated on this basis. With overcome
Input 2 0.81 the unigram issue of existing work where single word
Input 3 0.82 tweet is taken, Naïve Bayes is used in proposed work and
Input 4 0.88 takes overall tweet to find number of words in tweet and
their polarity.
Input 5 0.90
Following conclusions has been drawn from complete
experimental analysis.
1. Different size input gives different precision and
recall but all can be scale into single frame and
shows growing nature.
2. Proposed algorithm performs better for high
volume tweets in comparison with less number’s
dataset.
3. Proposed solution is capable to classify
ambiguous tweets and neutralize them according
to their weight.
4. It implements unigram and n-gram technique to
classify positive and negative tweets based on
Fig. 4. F-Score Comparison graph
average value.

74

Authorized licensed use limited to: Somaiya University. Downloaded on December 28,2023 at 13:29:24 UTC from IEEE Xplore. Restrictions apply.
5.0.79 as worst and 0.90 as best f-score has been [9] Suchita V Wawre1, Sachin N Deshmukh2 , “Sentiment
Classification using Machine Learning Techniques”, International
observed which denotes a great performance for Journal of Science and Research (IJSR) Volume 5 Issue 4, April
ambiguous tweets 2016.
VII. FUTURE WORK [10] Alessia D’Andrea, Fernando Ferri, Patrizia Grifoni, Tiziana
Guzzo,“Approaches, Tools and Applications for Sentiment
Following future word is predicted for proposed solution. Analysis Implementation”,International Journal of Computer
1. Proposed solution can be implemented using big Applications (0975– 8887) Volume 125 – No.3, September 2015.
[11] Hailong Zhang, Wenyan Gan, Bo Jiang, “Machine Learning and
data technology for large dataset. Lexicon based Methods for Sentiment Classification: A Survey”,
2. Proposed algorithm can be used to classify 978-1-4799-5727-9/14 $31.00 © 2014 IEEE DOI
different category tweets except airlines. 10.1109/WISA.2014.55.
[12] Turney PD, Littman ML. Measuring praise and criticism: inference
3. Proposed algorithm can be implemented with other of semantic orientation from association. ACM Trans Inf Syst.
classification techniques such as SVM and KNN to 2003;21(4):315–46.
achieve better performance.

REFERENCES

[1] Trupti Dange, Pankaj Bhalerao, “A Novel Approach for


Interpreting Public Sentiment Variations on Twitter: A Tweets”,
International Journal of Science and Research (IJSR), ISSN
(Online): 2319-7064. Volume 3 Issue 12, December 2014.
[2] Cambria E, Hussain A. Sentic computing: techniques, tools, and
applications. Dordrecht: Springer; 2012.
[3] Cambria E, Olsher D, Rajagopal D. SenticNet 3: a common and
common sense knowledge base for cognition-driven sentiment
analysis. In: AAAI. Quebec City; 2014. p. 1515–21.
[4] Cambria E, Schuller B, Xia Y, Havasi C. New avenues in opinion
mining and sentiment analysis. IEEE Intell Syst. 2013;28(2):15–
21.
[5] P. D. Turney, “Thumbs up or thumbs down?: semantic orientation
applied to unsupervised classification of Tweetss,” in Proceedings
of the 40th annual meeting on association for computational
linguistics, pp. 417–424, Association for Computational
Linguistics, 2002.
[6] Divya Sehgal and Dr. Ambuj Kumar Agarwal, “Sentiment
Analysis of Big Data Applications using Twitter Data with the
Help of HADOOP Framework”. 5th International Conference on
System Modeling & Advancement in Research Trends, 2016,
IEEE.
[7] M. Mazhar Rathore, Anand Paul, Awais Ahmad, “Big Data
Analytics of Geosocial Media for Planning and Real-Time
Decisions”. IEEE ICC 2017 SAC Symposium Big Data
Networking Track.
[8] Amit Gupte, Sourabh Joshi, Pratik Gadgul, Akshay Kadam,
“Comparative Study of Classification Algorithms used in
Sentiment Analysis”, International Journal of Computer Science
and Information Technologies, Vol. 5 (5) , 2014, 6261-6264.

75

Authorized licensed use limited to: Somaiya University. Downloaded on December 28,2023 at 13:29:24 UTC from IEEE Xplore. Restrictions apply.

You might also like