Please use this identifier to cite or link to this item:
http://arks.princeton.edu/ark:/88435/dsp012n49t445m
Title: | Classifying News, Satire, and "Fake News": An SVM and Deep Learning Approach |
Authors: | Hare, Adam |
Advisors: | Kornhauser, Alain |
Department: | Operations Research and Financial Engineering |
Certificate Program: | Applications of Computing Program |
Class Year: | 2018 |
Abstract: | The problem of false news articles has recently surged to the front of political discussion, particularly in the United States. Misinformation comes from a wide range of sources and major social media companies such as Facebook and Google have taken steps towards reducing the spread of so-called “fake news.” By their nature, many such deceptive news articles are difficult for humans to identify, as there may be conflicting reports, different interpretations of information, or a widespread distortion of the facts as the story circulates. While the notion of objective truth is one best left to the philosophers, it may be possible to make in-roads by studying a related category of articles: satire. Satirical articles often arise as part of the discussion when they are taken as fact by their readers and shared in a way to confuse a large part of the public. Many guides to “fake news" contain mostly websites that claim to be satirical. This is often not easy to verify as the satirical disclaimer may be hidden deep in the website's description. The ability to reliably and automatically categorize this subset of articles from the main body of news would be a useful tool to warn readers not to accept the article as fact. As standards and norms for politics and society change, so too must standards for news and satire. For this reason, this thesis considers a number of subsets of the data, based on date of publication. By considering both the corpus as a whole and the subsets individually, the goals of this paper are to A) develop an effective machine learning approach to identifying satire and B) see if a changing political and social climate has affected news and satire in a way discernible to a machine learning algorithm. For the purposes of labeling the data, articles coming from sites explicitly claiming to be satirical are labeled as “satire,” and the rest "serious.” The problem of separating satire from serious news is analogous to separating valid email from spam. For this reason, this paper uses many of the most common techniques from the field of spam filtering such as a Support Vector Machine with a linear kernel. The SVM uses a number of features established other works, chiefly a bag of words, and two new features based on links to Twitter and other websites. This thesis also implements a deep learning approach with a C-LSTM. The SVM with all features consistently achieved over 99% accuracy, 95% precision, and 96% recall when comparing satirical articles to serious ones. The C-LSTM achieved just under 99% accuracy with about 90% precision and recall. It was found that the "fake news" category is easier to separate from serious news but may share similarities with satire. Lastly, this thesis found that the date of publication is a significant factor in identifying satirical articles and that serious news from the past two years may be more similar to older satirical articles than previously. |
URI: | http://arks.princeton.edu/ark:/88435/dsp012n49t445m |
Type of Material: | Princeton University Senior Theses |
Language: | en |
Appears in Collections: | Operations Research and Financial Engineering, 2000-2020 |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
HARE-ADAM-THESIS.pdf | 3.37 MB | Adobe PDF | Request a copy |
Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.