Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01hq37vr63d
Full metadata record
DC FieldValueLanguage
dc.contributor.advisorRussakovsky, Olga
dc.contributor.authorYoon, Phillip
dc.date.accessioned2020-10-01T21:26:25Z-
dc.date.available2020-10-01T21:26:25Z-
dc.date.created2020-05-02
dc.date.issued2020-10-01-
dc.identifier.urihttp://arks.princeton.edu/ark:/88435/dsp01hq37vr63d-
dc.description.abstractThis paper details the design of a self-supervised model for sound separation and localization by capitalizing upon the natural correspondence between the audio and visual modalities of videos. Because of the temporal alignment between auditory and visual components, our deep learning-based approach allows us to leverage this synchronization in jointly fusing these signals to simultaneously learn the tasks of separation and localization. For every pixel region in a video, a binary mask is predicted and then overlaid on a spectrogram representation of the input audio to estimate the inferred sound coming from that region. To train our neural network, we employ a mix-and-separate framework to synthetically create training data from our dataset of stabilized videos. High performance was achieved from our joint audio-visual model, asserting the success of our proposed architecture in separating and localizing sound in videos.
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.titleImproving Sound Separation and Localization Using Audio-Visual Scene Analysis
dc.typePrinceton University Senior Theses
pu.date.classyear2020
pu.departmentComputer Science
pu.pdf.coverpageSeniorThesisCoverPage
pu.contributor.authorid920058657
Appears in Collections:Computer Science, 1988-2020

Files in This Item:
File Description SizeFormat 
YOON-PHILLIP-THESIS.pdf2.33 MBAdobe PDF    Request a copy


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.