Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp011j92gb18r
Title: Speech Synthesis for Text-Based Editing of Audio Narration
Authors: Jin, Zeyu
Advisors: Finkelstein, Adam
Contributors: Computer Science Department
Subjects: Computer science
Issue Date: 2018
Publisher: Princeton, NJ : Princeton University
Abstract: Recorded audio narration plays a crucial role in many contexts including online lectures, documentaries, demo videos, podcasts, and radio. However, editing audio narration using conventional software typically involves many painstaking low-level manipulations. Some state of the art systems allow the editor to perform select, cut, copy and paste operations in the text transcript of the narration and apply the changes to the waveform accordingly. However such interfaces do not support the ability to synthesize new words not appearing in the transcript. While it is possible to build a high fidelity speech synthesizer based on samples of a new voice, to operate well they typically require a large amount of voice data as input as well as substantial manual annotation. This thesis presents a speech synthesizer tailored for text-based editing of narrations. The basic idea is to synthesize the input word in a different voice using a standard pre-built speech synthesizer and then transform the voice to the desired voice using voice conversion. Unfortunately, conventional voice conversion does not produce synthesis with sufficient quality for the stated application. Hence, this thesis introduces new voice conversion techniques that synthesize words with high individuality and clarity. Three methods are proposed: the first approach is called CUTE, a data-driven voice conversion method based on frame-level unit selection and exemplar features. The second method called VoCo is built on CUTE but with several improvements that help the synthesized word blend more seamlessly into the context where it is inserted Both CUTE and VoCo select sequences of audio frames from the voice samples and stitch them together to approximate the voice being converted. The third method improves over VoCo with deep neural networks. It involves two networks: FFTNet generates high quality waveforms based on acoustic features, and TimbreNet transforms the acoustic feature of the generic synthesizer voice to that of a human voice.
URI: http://arks.princeton.edu/ark:/88435/dsp011j92gb18r
Alternate format: The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog: catalog.princeton.edu
Type of Material: Academic dissertations (Ph.D.)
Language: en
Appears in Collections:Computer Science

Files in This Item:
File Description SizeFormat 
Jin_princeton_0181D_12635.pdf4.83 MBAdobe PDFView/Download


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.