Statistical Methods for Robust Estimation of Differential Protein Expression Open Access
Wijayawardana, Sameera Rukshan (2011)
Abstract
Abstract
Statistical Methods for Robust Estimation of
Differential Protein Expression
By
Sameera R. Wijayawardana
Proteomics studies yield multi-layered data that pose challenges
for statistical analyses due in part to the inherent complexity of
the proteomes of organisms, and due to the variability of mass
spectrometry based methods that form the back bone of modern
proteomics methodologies. An active area of research in proteomics
is the assessment of differential expression of proteins in
different biological samples. To date, little
attention has been paid to ensuring the robustness of the
statistical results of proteomics data analyses. Nor have there
been rigorous attempts to adjust statistical results to account for
the high technical variability found in proteomics data. There is
also a lack of methods that address the issue of missing values in
a model based framework.
In this dissertation, we develop an estimator for the overall relative protein expression using a variant of the minimum norm quadratic unbiased estimation method. By assuming different distributional choices for a two-groups model underlying the mechanisms generating the relative expression values, we develop a robust and flexible finite mixture modeling approach for the estimation of the posterior probability of each protein to be non-differentially expressed. In addition, we investigate the utility of several non-standard statistical distributions: skew-normal, skew Student's t, and the generalized hyperbolic distribution, as candidate distributions for the fitted mixture components.
We account for latent error generating processes in proteomics data by conducting a reliability analysis of the data to remove a subset of the original data that are deemed less reliable, using a peptide ion current area based method to estimate relative protein expression at the peptide level, and through the use of novel class preserving nested resampling strategies to construct a bootstrap partial likelihood estimator of the overall relative expression level of each protein.
Furthermore, we illustrate the application of model based estimation strategies when proteomics data are assumed to be missing at random, using multivariate t and bivariate normal models to robustly estimate the mean and covariance matrix of an incompletely observed peptide level data.
Table of Contents
Contents 1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 An Introduction to Proteins, the Proteome and Proteomics . . . . 3
1.3 Mass Spectrometry (MS) based Proteomics . . . . . . . . . . . . . . 4
1.4 Statistical Methods in Proteomics Data Analysis . . . . . . . . . . . 5
1.5 Proteomics - Analytical Challenges . . . . . . . . . . . . . . . . . . . 7
1.6 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.7 Proposed Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.7.1 Robust Estimation of Labeling Based High-Throughput Relative
Protein Expressions . . . . . . . . . . . . . . . . . . . . . 12
1.7.2 Resampling Based Methods for Identifying Differentially Expressed
Proteins Using XIC Area . . . . . . . . . . . . . . . . 13
1.7.3 Estimating Relative Protein Expression Levels from Incomplete
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Background 152.1 Mass Spectrometric Methods for Protein Identification . . . . . . . . 15
2.2 Mass Spectrometric Methods for Protein Expression Profiling . . . . 17
2.3 Quantification based on Stable Isotope Labeling . . . . . . . . . . . . 17
2.3.1 In Vitro Labeling via Chemical Incorporation . . . . . . . . . 18
2.3.2 In Vivo Labeling via Metabolic Incorporation . . . . . . . . . 19
2.4 Statistical Methods in Preprocessing Proteomics Data . . . . . . . . . 20
2.4.1 Peak Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.2 Peak Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Statistical Methods in Identification of Peptides/Proteins . . . . . . 24
2.5.1 SEQUEST (Eng et al., (1994)) . . . . . . . . . . . . . . . . . 25
2.5.2 MASCOT (Perkins et al., (1999)) . . . . . . . . . . . . . . . . 25
2.5.3 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3 Robust Estimation of Labeling Based High-Throughput Relative
Protein Expression 273.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Identifying Differentially Expressed Proteins in Non-replicated Experiments
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.1 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.2 A Random Effects Model for Estimating Relative Protein Expression
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.3 Estimation of Model Parameters . . . . . . . . . . . . . . . . 33
3.2.4 Simultaneous Testing of Relative Protein Expression Levels . . 34
3.2.4.1 The Two-Groups Model . . . . . . . . . . . . . . . . 36
3.2.4.2 Local False Discovery Rate . . . . . . . . . . . . . . 36
3.2.5 Proposed Two-Groups Models . . . . . . . . . . . . . . . . . 38
3.2.6 Fitting a Two-Groups Model . . . . . . . . . . . . . . . . . . 41
3.2.6.1 Identifying the Null Region . . . . . . . . . . . . . . 42
3.2.6.2 Proportion of null proteins . . . . . . . . . . . . . . . 42
3.2.6.3 Evaluating the goodness of fit of fitted distributions . 44
3.2.6.4 Selecting the number of mixture components . . . . 45
3.2.6.5 EM algorithms for finite mixtures . . . . . . . . . . 47
3.2.6.6 Identifiability of Mixture Distributions . . . . . . . . 48
3.2.6.7 Estimating the local false discovery rate . . . . . . . 49
3.3 Nmix - Tmix Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.1 Estimating f0(z) and f(z) . . . . . . . . . . . . . . . . . . . . 50
3.4 sN - sTmix Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4.1 The Skew-Normal (sN) Distribution . . . . . . . . . . . . . . 52
3.4.2 The Doubly Truncated Skew-Normal (DTsN) Distribution . . 53
3.4.3 The Skew-t (sT) Distribution . . . . . . . . . . . . . . . . . . 54
3.4.4 Estimating f0(z) and f(z) . . . . . . . . . . . . . . . . . . . . 55
3.5 sNmix-GH Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.5.1 The Generalized Hyperbolic (GH) Distribution . . . . . . . . . 57
3.5.2 Estimating f0(z) and f(z) . . . . . . . . . . . . . . . . . . . . 59
3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.6.1 Fitting the Null Distribution, f0(z) . . . . . . . . . . . . . . . 61
3.6.2 Fitting the Full Distribution, f(z) . . . . . . . . . . . . . . . . 67
3.6.3 Number of Mixture Components and Goodness of Fit . . . . . 68
3.6.4 Local False Discovery Rate . . . . . . . . . . . . . . . . . . . . 71
3.6.5 False Positive and False Negative Rates . . . . . . . . . . . . . 71
3.6.6 Robustness of Results . . . . . . . . . . . . . . . . . . . . . . 73
3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.8 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.8.1 Bayesian Hierarchical Modeling of Replicated SILAC Data . . 77
3.8.2 Estimation of Model Parameters . . . . . . . . . . . . . . . . 79
3.8.3 Estimating the local fdr . . . . . . . . . . . . . . . . . . . . . 81
4 Resampling Based Methods for Identifying Differentially Expressed
Proteins using XIC Area 834.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.2 Reliability Analysis of SILAC Data . . . . . . . . . . . . . . . . . . . 85
4.3 Evaluation of the Protein Relative Expression Ratio using Extracted
Ion Current (XIC) Area . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.3.0.1 The Savitzky-Golay smoothing filter . . . . . . . . . 89
4.3.0.2 Estimating the relative expression ratio using XIC area 90
4.4 Resampling Based Estimation of Overall Protein Relative Expression
using XIC area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.4.1 Estimation of Relative Protein Expression using a Bootstrap
Partial Maximum Likelihood Estimator (BPMLE) . . . . . . 92
4.4.2 Estimation of Relative Protein Expression using a Model-based
Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.4.2.1 Robust regression using M-estimation . . . . . . . . 95
4.4.2.2 Influence of covariates on protein expression estimation 97
4.4.3 p-value Estimation and FDR . . . . . . . . . . . . . . . . . . . 97
4.4.3.1 A p-value based on the nested-bootstrap samples . . 98
4.4.3.2 Local False Discovery Rate Estimation . . . . . . . . 99
4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.5.1 Estimation of Relative Protein Expression using a Bootstrap
Partial Maximum Likelihood Estimator (BPMLE) . . . . . . . 103
4.5.2 Estimation of Relative Protein Expression using a Model-based
Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5 Estimating Relative Protein Expression Levels from Incomplete Data 114
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.1.1 Setup of the data . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.1.2 Types of Missing Data Patterns and Mechanisms . . . . . . . 117
5.2 Estimating Relative Protein Expression Levels from Incomplete Peptide
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.2.1 A Test of MCAR for Multivariate Data . . . . . . . . . . . . 120
5.2.2 A likelihood Ratio Based Test of MCAR . . . . . . . . . . . . 120
5.2.3 A Multivariate General-MAR Model for Incomplete Peptide
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.2.4 A Robust Alternative to the Multivariate Normal Estimation 124
5.2.5 Estimating the True Relative Protein Expression Ratio . . . . 126
5.3 A Missing Data Model for Single Peptide Proteins . . . . . . . . . . 126
5.3.1 Setup of the data . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.3.2 A Test of MCAR for Bivariate Normal Monotone-Missing Data 128
5.3.3 A Bivariate Normal Monotone-MAR Model . . . . . . . . . . 129
5.3.4 Small sample inference . . . . . . . . . . . . . . . . . . . . . . 131
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.4.1 Estimating Relative Protein Expression from Incomplete Peptide
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.4.2 Estimating Relative Protein Expression from Single Peptide Data 137
5.4.2.1 Small Sample Confidence Intervals . . . . . . . . . . 140
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.6 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.6.1 A Pattern Mixture Model (PMM) for Single Peptide Proteins 144
5.6.2 Choice of λ. . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.6.3 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . 148
Appendices 1495.1 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
5.2 Chapter 5 - Appendices . . . . . . . . . . . . . . . . . . . . . . . . . 150
5.2.1 Appendix A: Parameter Estimates of the Multivariate t Models
Fitted to YGR192C . . . . . . . . . . . . . . . . . . . . . . . 150
5.2.2 Appendix B: Posterior Distribution and Draws of µh, σhh, σlh,
and R-hat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Bibliography 152About this Dissertation
School | |
---|---|
Department | |
Degree | |
Submission | |
Language |
|
Research Field | |
Keyword | |
Committee Chair / Thesis Advisor | |
Committee Members |
Primary PDF
Thumbnail | Title | Date Uploaded | Actions |
---|---|---|---|
Statistical Methods for Robust Estimation of Differential Protein Expression () | 2018-08-28 15:28:25 -0400 |
|
Supplemental Files
Thumbnail | Title | Date Uploaded | Actions |
---|