A research on different algorithm perform on sentiment analysis with different size dataset

ABSTRACT

Our task

Our task is to predict the sentiment (positive or negative) in a tweet text based on the text itself using several different algorithms, and finally compare the result and accuracy among these different algorithms. In this project, we analyzed the sentiment distribution and tried to recognize the sentiment features from test based on tweet sentiment analysis dataset. We plan to adapt different ML algorithms and visualize their performance under different conditions.

Why is the task important

Because sentiment analysis is useful in social media, like determining market strategy, improving customer service, testing business KPIs and so on. Also it is convenient to get the related datasets on Github, Kaggle or Twitter.

Learners we use

Logistic Regression, Support Vector Machine, Decision Tree, Nearest Neighbor, LSTM and Naive Bayes. The strategy we used to convert text into high-dimensional vector is called bag of word method, so the features are words in the word vector. To find a best representation of tweet text, we tried different dimensions of word vectors.

Key results

1. The effect of different bag of word Generally, the accuracy of using a 1000-dimensional bag of word is higher than that of 2000-dimensional bag of word, because 2000-D increases the noise. 2. The effect of different learners LSTM has the highest accuracy (about 75%), while Naive Bayes has the lowest accuray (about 66.5%).