EXPLORING RICH FEATURES FOR
SENTIMENT ANALYSIS WITH VARIOUS
MACHINE LEARNING MODELS

Li, Shuyang

Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01jq085n41j

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Li, Xiaoyan	-
dc.contributor.author	Li, Shuyang	-
dc.date.accessioned	2016-06-24T14:40:50Z	-
dc.date.available	2016-06-24T14:40:50Z	-
dc.date.created	2016-04-12	-
dc.date.issued	2016-06-24	-
dc.identifier.uri	http://arks.princeton.edu/ark:/88435/dsp01jq085n41j	-
dc.description.abstract	Understanding sentiment is an important task in natural language processing. In this paper we investigate the use of rich features to extend the bag-of-words model for sentiment analysis using machine learning in the movie review domain. We focus on the areas of subjectivity analysis, negation handling, and aggregate document features, and we investigate three ensemble methods and four singular classifiers. Our experimental results show that AdaBoost performs best among all classifiers on the simple unigram feature set, while the Maximum Entropy classifier provides best performance on our enhanced feature sets. Stochastic Gradient Descent is nearly as accurate as AdaBoost and significantly faster. We also examine 128 commonly misclassified reviews and identify additional challenges to NLP in the movie review domain. We have been able to increase classifier performance through the addition of aggregate document polarity and purity features and summary sentence features based on manual subjectivity and summary sentence extraction. From this, we see potential to improve classification accuracy through improved automatic subjectivity analysis methods and summarization. Additional gains may be made by using a domain-specific polarity lexicon to generate aggregate features. We created a manually labeled set of subjective and summary sentences for each review in our corpus. This may serve as a useful benchmark dataset for future work in subjectivity analysis. Using the manually labeled corpus solely to restrict the feature space reduces classifier performance, while using it as a base to generate aggregate features improves accuracy. We also see that using manual subjectivity analysis for both feature restriction and aggregate feature generation further improves classification performance. This suggests that subjectivity analysis is useful for generating rich features as well as for feature space restriction.	en_US
dc.format.extent	86 pages	*
dc.language.iso	en_US	en_US
dc.title	EXPLORING RICH FEATURES FOR SENTIMENT ANALYSIS WITH VARIOUS MACHINE LEARNING MODELS	en_US
dc.type	Princeton University Senior Theses	-
pu.date.classyear	2016	en_US
pu.department	Operations Research and Financial Engineering	en_US
pu.pdf.coverpage	SeniorThesisCoverPage	-
Appears in Collections:	Operations Research and Financial Engineering, 2000-2020

Files in This Item:

File	Size	Format
Li_Shuyang_final_thesis.pdf	2.37 MB	Adobe PDF	Request a copy

Show simple item record

Search

Browse