Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01n009w514c
Title: Methods for Labeling Big Data: Active Learning and Data Programming
Authors: Zimmer, Jacob
Advisors: Stewart, Brandon
Engelhardt, Barbara
Department: Computer Science
Certificate Program: Center for Statistics and Machine Learning
Class Year: 2019
Abstract: Advances in machine learning algorithms have made the adage about the unreasonable effectiveness of data more true than ever, but as the size of data grows it becomes increasingly difficult to create a training set of sufficient size and quality to take full advantage of algorithmic gains. A variety of approaches using various levels of supervision have been proposed to bridge this divide, but the two most promising are Active Learning and Data Programming. In this work, the performance of these strategies and a hybrid approach which incorporates elements of both are compared. It is found that given access to enough data, Active Learning can outperform either Data Programming or the Hybrid approach, but in situations where labeling is expensive it can be worthwhile to use Data Programming. A case study analyzing the political content of emails from the Enron Corporation confirms this finding.
URI: http://arks.princeton.edu/ark:/88435/dsp01n009w514c
Type of Material: Princeton University Senior Theses
Language: en
Appears in Collections:Computer Science, 1988-2020

Files in This Item:
File Description SizeFormat 
ZIMMER-JACOB-THESIS.pdf1.18 MBAdobe PDF    Request a copy


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.