High-throughput methods for in silico discovery of peptides, proteins, and post-translational modifications in proteomics

Baliban, Richard  Christopher

Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01ns0646068

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Floudas, Christodoulos A	en_US
dc.contributor.author	Baliban, Richard Christopher	en_US
dc.contributor.other	Chemical and Biological Engineering Department	en_US
dc.date.accessioned	2012-11-15T23:57:35Z	-
dc.date.available	2012-11-15T23:57:35Z	-
dc.date.issued	2012	en_US
dc.identifier.uri	http://arks.princeton.edu/ark:/88435/dsp01ns0646068	-
dc.description.abstract	The field of proteomics seeks to address a grand problem in biology where large-scale determination of the gene and cellular function of an organism is directly analyzed at the protein level. Over the last decade, liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) has emerged as a prominent tool within the field due to the capacity for high-throughput and high-sensitivity experimental designs. The resulting output from LC-MS/MS systems often include thousands of MS/MS spectra, each of which is a complex piece of data that must be analyzed to extract relevant information about the proteins contained in a cellular sample. These data sets are often noisy and therefore require sophisticated and robust tools that are capable of efficiently processing the information. This thesis presents several mathematical models and algorithms that address three major areas of open research problems in proteomics: (1) post-translational modification (PTM) identification at the peptide level, (2) unmodified and modified protein identification, and (3) determination of optimal biomarker combinations. When conducting a LC-MS/MS experiment, the prime objective is the identification of a complete list of samples proteins along with all identified PTMs. This is a major challenge due to the vast increase in computational complexity obtained from introduction of over 900 modifications to a typical 20 amino acid universe. Two novel algorithms were developed based on integer linear optimization for (1) the identification of a comprehensive list of all proteins and (2) the untargeted identification of all modifications along a template peptide sequence. Existing peptide identification algorithms are utilized to initially determine all unmodified peptides which are input to the protein identification algorithm to determine the list of all sample proteins. An untargeted search for all modified amino acid sites within the protein list is then performed using a universal set of all PTMs. Demonstration of these algorithms results in superior accuracy on both small and large-scale data sets when benchmarked against existing state-of-the art methods. The complete suite of algorithms was fully integrated into a webtool that was made freely available to the scientific community. Using the above algorithms, gingival crevicular fluid (GCF) samples were analyzed to identify novel biomarker combinations of proteins that could effectively diagnose individuals that are either periodontally healthy (PH) or afflicted with chronic periodontitis (CP). A training set of 12 PH and 12 CP samples identified 432 human and 30 bacterial proteins, 150 of which were not previously identified in large-scale proteomics analysis. GCF samples were obtained from 72 additional subjects, and a mixed-integer optimization model was developed to identify the optimal combination of biomarkers for diagnosis of PH or CP individuals. A thorough cross-validation of the model capability was performed on a training set of 55 samples, and greater than 99% accuracy was consistently achieved. The model was then tested on two blind test sets, and using an optimal combination of 7 human proteins and 3 bacterial proteins, the model was able to correctly predict 40 out of 41 PH and CP samples.	en_US
dc.language.iso	en	en_US
dc.publisher	Princeton, NJ : Princeton University	en_US
dc.relation.isformatof	The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the <a href=http://catalog.princeton.edu> library's main catalog </a>	en_US
dc.subject	Biomarkers	en_US
dc.subject	Gingival Crevicular Fluid	en_US
dc.subject	Integer Linear Optimization	en_US
dc.subject	Post-translational Modifications	en_US
dc.subject	Protein Identification	en_US
dc.subject	Tandem Mass Spectrometry	en_US
dc.subject.classification	Chemical engineering	en_US
dc.subject.classification	Molecular biology	en_US
dc.subject.classification	Bioinformatics	en_US
dc.title	High-throughput methods for in silico discovery of peptides, proteins, and post-translational modifications in proteomics	en_US
dc.type	Academic dissertations (Ph.D.)	en_US
pu.projectgrantnumber	690-2143	en_US
Appears in Collections:	Chemical and Biological Engineering

Files in This Item:

File	Description	Size	Format
Baliban_princeton_0181D_10425.pdf		1.26 MB	Adobe PDF	View/Download

Show simple item record

Search

Browse