Statistical Methods for Robust Estimation of Diﬀerential Protein Expression Open Access

Wijayawardana, Sameera Rukshan (2011)

Permanent URL: https://etd.library.emory.edu/concern/etds/qn59q4759?locale=en%255D

Published

Abstract

Abstract
Statistical Methods for Robust Estimation of
Differential Protein Expression
By
Sameera R. Wijayawardana

Proteomics studies yield multi-layered data that pose challenges for statistical analyses due in part to the inherent complexity of the proteomes of organisms, and due to the variability of mass spectrometry based methods that form the back bone of modern proteomics methodologies. An active area of research in proteomics is the assessment of differential expression of proteins in different biological samples. To date, little attention has been paid to ensuring the robustness of the statistical results of proteomics data analyses. Nor have there been rigorous attempts to adjust statistical results to account for the high technical variability found in proteomics data. There is also a lack of methods that address the issue of missing values in a model based framework.

In this dissertation, we develop an estimator for the overall relative protein expression using a variant of the minimum norm quadratic unbiased estimation method. By assuming different distributional choices for a two-groups model underlying the mechanisms generating the relative expression values, we develop a robust and flexible finite mixture modeling approach for the estimation of the posterior probability of each protein to be non-differentially expressed. In addition, we investigate the utility of several non-standard statistical distributions: skew-normal, skew Student's t, and the generalized hyperbolic distribution, as candidate distributions for the fitted mixture components.

We account for latent error generating processes in proteomics data by conducting a reliability analysis of the data to remove a subset of the original data that are deemed less reliable, using a peptide ion current area based method to estimate relative protein expression at the peptide level, and through the use of novel class preserving nested resampling strategies to construct a bootstrap partial likelihood estimator of the overall relative expression level of each protein.

Furthermore, we illustrate the application of model based estimation strategies when proteomics data are assumed to be missing at random, using multivariate t and bivariate normal models to robustly estimate the mean and covariance matrix of an incompletely observed peptide level data.

Contents 1 Introduction 1

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 An Introduction to Proteins, the Proteome and Proteomics . . . . 3

1.3 Mass Spectrometry (MS) based Proteomics . . . . . . . . . . . . . . 4

1.4 Statistical Methods in Proteomics Data Analysis . . . . . . . . . . . 5

1.5 Proteomics - Analytical Challenges . . . . . . . . . . . . . . . . . . . 7

1.6 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.6.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.7 Proposed Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.7.1 Robust Estimation of Labeling Based High-Throughput Relative

Protein Expressions . . . . . . . . . . . . . . . . . . . . . 12

1.7.2 Resampling Based Methods for Identifying Differentially Expressed

Proteins Using XIC Area . . . . . . . . . . . . . . . . 13

1.7.3 Estimating Relative Protein Expression Levels from Incomplete

Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Background 15

2.1 Mass Spectrometric Methods for Protein Identification . . . . . . . . 15

2.2 Mass Spectrometric Methods for Protein Expression Profiling . . . . 17

2.3 Quantification based on Stable Isotope Labeling . . . . . . . . . . . . 17

2.3.1 In Vitro Labeling via Chemical Incorporation . . . . . . . . . 18

2.3.2 In Vivo Labeling via Metabolic Incorporation . . . . . . . . . 19

2.4 Statistical Methods in Preprocessing Proteomics Data . . . . . . . . . 20

2.4.1 Peak Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.4.2 Peak Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.5 Statistical Methods in Identification of Peptides/Proteins . . . . . . 24

2.5.1 SEQUEST (Eng et al., (1994)) . . . . . . . . . . . . . . . . . 25

2.5.2 MASCOT (Perkins et al., (1999)) . . . . . . . . . . . . . . . . 25

2.5.3 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3 Robust Estimation of Labeling Based High-Throughput Relative

Protein Expression 27

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 Identifying Differentially Expressed Proteins in Non-replicated Experiments

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.1 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.2 A Random Effects Model for Estimating Relative Protein Expression

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.3 Estimation of Model Parameters . . . . . . . . . . . . . . . . 33

3.2.4 Simultaneous Testing of Relative Protein Expression Levels . . 34

3.2.4.1 The Two-Groups Model . . . . . . . . . . . . . . . . 36

3.2.4.2 Local False Discovery Rate . . . . . . . . . . . . . . 36

3.2.5 Proposed Two-Groups Models . . . . . . . . . . . . . . . . . 38

3.2.6 Fitting a Two-Groups Model . . . . . . . . . . . . . . . . . . 41

3.2.6.1 Identifying the Null Region . . . . . . . . . . . . . . 42

3.2.6.2 Proportion of null proteins . . . . . . . . . . . . . . . 42

3.2.6.3 Evaluating the goodness of fit of fitted distributions . 44

3.2.6.4 Selecting the number of mixture components . . . . 45

3.2.6.5 EM algorithms for finite mixtures . . . . . . . . . . 47

3.2.6.6 Identifiability of Mixture Distributions . . . . . . . . 48

3.2.6.7 Estimating the local false discovery rate . . . . . . . 49

3.3 Nmix - Tmix Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.3.1 Estimating f0(z) and f(z) . . . . . . . . . . . . . . . . . . . . 50

3.4 sN - sTmix Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.4.1 The Skew-Normal (sN) Distribution . . . . . . . . . . . . . . 52

3.4.2 The Doubly Truncated Skew-Normal (DTsN) Distribution . . 53

3.4.3 The Skew-t (sT) Distribution . . . . . . . . . . . . . . . . . . 54

3.4.4 Estimating f0(z) and f(z) . . . . . . . . . . . . . . . . . . . . 55

3.5 sNmix-GH Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.5.1 The Generalized Hyperbolic (GH) Distribution . . . . . . . . . 57

3.5.2 Estimating f0(z) and f(z) . . . . . . . . . . . . . . . . . . . . 59

3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.6.1 Fitting the Null Distribution, f0(z) . . . . . . . . . . . . . . . 61

3.6.2 Fitting the Full Distribution, f(z) . . . . . . . . . . . . . . . . 67

3.6.3 Number of Mixture Components and Goodness of Fit . . . . . 68

3.6.4 Local False Discovery Rate . . . . . . . . . . . . . . . . . . . . 71

3.6.5 False Positive and False Negative Rates . . . . . . . . . . . . . 71

3.6.6 Robustness of Results . . . . . . . . . . . . . . . . . . . . . . 73

3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

3.8 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

3.8.1 Bayesian Hierarchical Modeling of Replicated SILAC Data . . 77

3.8.2 Estimation of Model Parameters . . . . . . . . . . . . . . . . 79

3.8.3 Estimating the local fdr . . . . . . . . . . . . . . . . . . . . . 81

4 Resampling Based Methods for Identifying Differentially Expressed

Proteins using XIC Area 83

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.2 Reliability Analysis of SILAC Data . . . . . . . . . . . . . . . . . . . 85

4.3 Evaluation of the Protein Relative Expression Ratio using Extracted

Ion Current (XIC) Area . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.3.0.1 The Savitzky-Golay smoothing filter . . . . . . . . . 89

4.3.0.2 Estimating the relative expression ratio using XIC area 90

4.4 Resampling Based Estimation of Overall Protein Relative Expression

using XIC area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.4.1 Estimation of Relative Protein Expression using a Bootstrap

Partial Maximum Likelihood Estimator (BPMLE) . . . . . . 92

4.4.2 Estimation of Relative Protein Expression using a Model-based

Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.4.2.1 Robust regression using M-estimation . . . . . . . . 95

4.4.2.2 Influence of covariates on protein expression estimation 97

4.4.3 p-value Estimation and FDR . . . . . . . . . . . . . . . . . . . 97

4.4.3.1 A p-value based on the nested-bootstrap samples . . 98

4.4.3.2 Local False Discovery Rate Estimation . . . . . . . . 99

4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.5.1 Estimation of Relative Protein Expression using a Bootstrap

Partial Maximum Likelihood Estimator (BPMLE) . . . . . . . 103

4.5.2 Estimation of Relative Protein Expression using a Model-based

Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5 Estimating Relative Protein Expression Levels from Incomplete Data 114

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.1.1 Setup of the data . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.1.2 Types of Missing Data Patterns and Mechanisms . . . . . . . 117

5.2 Estimating Relative Protein Expression Levels from Incomplete Peptide

Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.2.1 A Test of MCAR for Multivariate Data . . . . . . . . . . . . 120

5.2.2 A likelihood Ratio Based Test of MCAR . . . . . . . . . . . . 120

5.2.3 A Multivariate General-MAR Model for Incomplete Peptide

Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5.2.4 A Robust Alternative to the Multivariate Normal Estimation 124

5.2.5 Estimating the True Relative Protein Expression Ratio . . . . 126

5.3 A Missing Data Model for Single Peptide Proteins . . . . . . . . . . 126

5.3.1 Setup of the data . . . . . . . . . . . . . . . . . . . . . . . . . 127

5.3.2 A Test of MCAR for Bivariate Normal Monotone-Missing Data 128

5.3.3 A Bivariate Normal Monotone-MAR Model . . . . . . . . . . 129

5.3.4 Small sample inference . . . . . . . . . . . . . . . . . . . . . . 131

5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

5.4.1 Estimating Relative Protein Expression from Incomplete Peptide

Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

5.4.2 Estimating Relative Protein Expression from Single Peptide Data 137

5.4.2.1 Small Sample Confidence Intervals . . . . . . . . . . 140

5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

5.6 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

5.6.1 A Pattern Mixture Model (PMM) for Single Peptide Proteins 144

5.6.2 Choice of λ. . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

5.6.3 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . 148

Appendices 149

5.1 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

5.2 Chapter 5 - Appendices . . . . . . . . . . . . . . . . . . . . . . . . . 150

5.2.1 Appendix A: Parameter Estimates of the Multivariate t Models

Fitted to YGR192C . . . . . . . . . . . . . . . . . . . . . . . 150

5.2.2 Appendix B: Posterior Distribution and Draws of µh, σhh, σlh,

and R-hat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

Bibliography 152

About this Dissertation

Rights statement

Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.

School	Laney Graduate School
Department	Biostatistics
Degree	PhD
Submission	Dissertation
Language	English
Research Field	Statistics
Keyword	Finite mixture models Bootstrap partial likelihood Robust estimation Generalized distributions Proteomics Non-ignorable missingness
Committee Chair / Thesis Advisor	Hanfelt, John, Emory University Yu, Tianwei, Emory University
Committee Members	Manatunga, Amita, Emory University Peng, Junmin, Emory University

Last modified

Primary PDF

Thumbnail	Title	Date Uploaded	Actions
	Statistical Methods for Robust Estimation of Diﬀerential Protein Expression ()	2018-08-28 15:28:25 -0400	Download

Statistical Methods for Robust Estimation of Diﬀerential Protein Expression Open Access

Wijayawardana, Sameera Rukshan (2011)

Abstract

Table of Contents

About this Dissertation

Primary PDF

Supplemental Files