Classifying News, Satire, and "Fake News": An SVM and Deep Learning Approach

Hare, Adam

Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp012n49t445m

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Kornhauser, Alain	-
dc.contributor.author	Hare, Adam	-
dc.date.accessioned	2018-08-17T19:33:35Z	-
dc.date.available	2018-08-17T19:33:35Z	-
dc.date.created	2018-04-16	-
dc.date.issued	2018-08-17	-
dc.identifier.uri	http://arks.princeton.edu/ark:/88435/dsp012n49t445m	-
dc.description.abstract	The problem of false news articles has recently surged to the front of political discussion, particularly in the United States. Misinformation comes from a wide range of sources and major social media companies such as Facebook and Google have taken steps towards reducing the spread of so-called “fake news.” By their nature, many such deceptive news articles are difficult for humans to identify, as there may be conflicting reports, different interpretations of information, or a widespread distortion of the facts as the story circulates. While the notion of objective truth is one best left to the philosophers, it may be possible to make in-roads by studying a related category of articles: satire. Satirical articles often arise as part of the discussion when they are taken as fact by their readers and shared in a way to confuse a large part of the public. Many guides to “fake news" contain mostly websites that claim to be satirical. This is often not easy to verify as the satirical disclaimer may be hidden deep in the website's description. The ability to reliably and automatically categorize this subset of articles from the main body of news would be a useful tool to warn readers not to accept the article as fact. As standards and norms for politics and society change, so too must standards for news and satire. For this reason, this thesis considers a number of subsets of the data, based on date of publication. By considering both the corpus as a whole and the subsets individually, the goals of this paper are to A) develop an effective machine learning approach to identifying satire and B) see if a changing political and social climate has affected news and satire in a way discernible to a machine learning algorithm. For the purposes of labeling the data, articles coming from sites explicitly claiming to be satirical are labeled as “satire,” and the rest "serious.” The problem of separating satire from serious news is analogous to separating valid email from spam. For this reason, this paper uses many of the most common techniques from the field of spam filtering such as a Support Vector Machine with a linear kernel. The SVM uses a number of features established other works, chiefly a bag of words, and two new features based on links to Twitter and other websites. This thesis also implements a deep learning approach with a C-LSTM. The SVM with all features consistently achieved over 99% accuracy, 95% precision, and 96% recall when comparing satirical articles to serious ones. The C-LSTM achieved just under 99% accuracy with about 90% precision and recall. It was found that the "fake news" category is easier to separate from serious news but may share similarities with satire. Lastly, this thesis found that the date of publication is a significant factor in identifying satirical articles and that serious news from the past two years may be more similar to older satirical articles than previously.	en_US
dc.format.mimetype	application/pdf	-
dc.language.iso	en	en_US
dc.title	Classifying News, Satire, and "Fake News": An SVM and Deep Learning Approach	en_US
dc.type	Princeton University Senior Theses	-
pu.date.classyear	2018	en_US
pu.department	Operations Research and Financial Engineering	en_US
pu.pdf.coverpage	SeniorThesisCoverPage	-
pu.contributor.authorid	960963873	-
pu.certificate	Applications of Computing Program	en_US
Appears in Collections:	Operations Research and Financial Engineering, 2000-2020

Files in This Item:

File	Description	Size	Format
HARE-ADAM-THESIS.pdf		3.37 MB	Adobe PDF	Request a copy

Show simple item record

Search

Browse