Algorithmic Approaches to Classifying Biological Networks Pubblico
Bray, Margaret Justice (2015)
Abstract
As technology has become more advanced, the ease with which data can be collected has improved. This has left researchers with copious amounts of information, so much information that previous analytical techniques fall short. This has led to an increase in the popularity of representing data with networks. However, this data often have errors. This makes any conclusion gleamed from the analysis of a network unreliable. One type of network for which the inaccuracies are a particular issue is the protein-protein interaction (PPI) network. Researchers would like to use these networks to detect and diagnosis diseases by identifying specific interactions. Unfortunately, the errors in the networks make this impossible. One way to fix this is classify the empirical network into a category of model graph. By doing so, we will be able to mathematically predict which interactions are legitimate, and which are not. In this dissertation, we begin by testing the classification accuracy of five algorithms: degree distribution distance (DDD), characteristic curve (CC), relative graphlet frequency (RGF), graphlet degree distribution using arithmetic mean (GDD (A)) and using geometric mean (GDD (G)). Overall accuracies were poor, ranging from 68% for the GDD (A) down to 47% for the DDD. With accuracies this low, it is difficult to trust the classification results for an empirical network of unknown origin. Therefore, we propose two solutions. First, we provide several modifications to both versions of the GDD. The reformulated GDD is more accurate, classifying 76% of known graphs correctly, while also performing the analysis with increased speed. Second, we present a new classification algorithm: cross scoring. This novel method works by comparing networks based on a pre-selected group of network measures. Each type of model graph is ranked by how close its measure value falls to the empirical value compared to the other model types considered. Points are awarded and the model type with the fewest points at the end of the comparisons is considered the best fit. Accuracy across twelve trials was 82.9% with a standard deviation of 0.98%. These results are an obvious improvement over the five original algorithms considered.
Table of Contents
1 Introduction to Graphs and Graph Theory 1
1.1 Review of Essential Graph Theory Elements 3
1.2 Network Measures 5
1.2.1 Small-World and Scale-Free 9
1.2.2 Centrality 13
1.3 Discussion 15
2 Introduction to the Structure of Biomolecular Networks 17
2.1 Protein-Protein Interactions 17
2.2 Sacchraomyces cerevisiae Protein-Protein Interaction Network 18
2.3 Motivation for Network Classification 21
3 Introduction to Model Graphs 23
3.1 Model Graph Descriptions 23
3.1.1 RandomStatic 25
3.1.2 Small-World 25
3.1.3 3D-Geometric 26
3.1.4 Linear Preferential Attachment 26
3.1.5 Random Growing 27
3.1.6 Aging Vertex 28
3.1.7 Duplication-Mutation-Complementation and Duplication-Mutation using Random Mutation 28
3.1.8 Stickiness Model 29
3.2 Methods 30
3.3 Results 30
3.3.1 Numbers of Nodes and Edges 30
3.3.2 Density 34
3.3.3 Proportion of Nodes in the Giant Component 36
3.3.4 Diameter, Radius, and Average Shortest Path Length 38
3.3.5 Average Degree and Assortativity 42
3.3.6 S-metric 45
3.3.7 Clustering 46
3.3.8 Centralities: Betweenness, Closeness, Degree, and Eigenvector 49
3.4 Discussion 53
4 Measure Based Comparison of Model Graphs v Saccharomyces cerevisiae PPI Network 57
4.1 Methods 57
4.2 Results 58
4.2.1 Size Measures 58
4.2.2 Distance Measures 63
4.2.3 Centrality Measures 67
4.2.4 Connection Measures 69
4.2.5 Biologically Significant Measures 72
4.2.6 Summary of Measure Based Comparison Broken Down by Category 74
4.3 Discussion 75
5 Introduction to Network Classification Methods 78
5.1 Network Classification Methods 79
5.1.1 Relative Graphlet Frequency and Graphlet Degree Distribution 79
5.1.2 Characteristic Curve 82
5.1.3 Degree Distribution Distance 85
5.2 Limitations of Previous Work 86
6 Random Graph Classification 87
6.1 Methods 87
6.2 Results 90
6.3 Discussion 94
7 Model Graph Classification 96
7.1 Methods 96
7.2 Results 99
7.2.1 Filtering Out Large Graphs 99
7.2.2 Degree Distribution Distance 100
7.2.3 Characteristic Curve 105
7.2.4 Relative Graphlet Frequency 109
7.2.5 Graphlet Degree Distribution 112
7.2.6 Comparison of Classification Accuracy Broken Down by Model Type and Method 117
7.2.7 Patterns in Statistical Performance 119
7.2.8 Treatment of DMC and DMR Model Graphs 121
7.3 Discussion 123
7.3.1 Strengths and Limitations 126
7.3.2 Next Steps 129
8 Saccharomyces cerevisiae PPI Network Classification 130
8.1 Methods 130
8.2 Results 132
8.2.1 Degree Distribution Distance 132
8.2.2 Characteristic Curve 134
8.2.3 Relative Graphlet Frequency 135
8.2.4 Graphlet Degree Distribution 137
8.2.5 Kendall's W Comparison of Ranking Lists 139
8.3 Discussion 140
9 Relative Graphlet Frequency
9.1 Formula Error 142
9.2 Methods 143
9.3 Results 144
9.3.1 Random Graph Classification 144
9.3.2 Model Graph Classification 144
9.3.3 Saccharomyces cerevisiae PPI Network Classification 148
9.4 Conclusions 151
10 Reformulations of the Graphlet Degree Distribution 153
10.1 Graphlet Degree Distribution Issues 154
10.1.1 Geometric Mean 155
10.1.2 Contradictory Outcomes 158
10.1.3 Scaling and Normalization 164
10.2 Methods 165
10.2.1 Version 1 166
10.2.2 Version 2 166
10.2.3 Version 3 167
10.2.4 Analysis of Performance 168
10.3 Results 168
10.3.1 Model Graph Classification 168
10.3.2 Comparison of the Original Graphlet Degree Distribution to the Reformulated Versions 176
10.3.3 Saccharomyces cerevisiae PPI Network Classification 179
10.4 Conclusions 181
11 Designing the Cross Scoring Algorithm 182
11.1 Methods 182
11.1.1 Measures of Center and Spread 183
11.1.2 Nonlinear Scoring 183
11.1.3 Zeroing 184
11.1.4 Approximations 185
11.1.5 Tie Breaking 185
11.2 Data 186
11.3 Results 187
11.3.1 Mean Results 188
11.3.2 Median Results 189
11.3.3 Comparison of Mean and Median Results 192
11.4 Discussion 193
12 Determining the Cross Scoring Measure List 195
12.1 Methods 195
12.1.1 Macro- v Micro-Scoring 196
12.1.2 Importance-Scoring 196
12.2 Data 196
12.3 Results 198
12.3.1 Macro-lists 199
12.3.2 Micro-list 199
12.3.3 Importance-Scoring 201
12.4 Discussion 201
13 Applying the Cross Scoring Algorithm 203
13.1 Methods 203
13.2 Data 204
13.3 Results 204
13.3.1 Measure Selection 204
13.3.2 Macro-Lists 206
13.3.3 Macro-Scoring Performance 208
13.3.4 Micro-Scoring Performance 213
13.3.5 Macro- v Micro-Scoring Results 218
13.3.6 Classification of the S. cerevisiae PPI Network 220
13.3.7 Importance-Scoring 223
13.3.8 Robustness 225
13.3.9 Comparison of All Classifiers 225
13.4 Discussion 227
13.4.1 Biological Implications and their Effect on the Cross Scoring Design 229
13.4.2 Strengths and Limitations 230
14 Summary 232
14.1 Overview 232
14.1.1 Which Model Type is the Best Fit for the Saccharomyces cerevisiae PPI Network? 235
14.1.2 Do PPI Networks Exhibit Scale-Free Properties? 236
14.2 Future Work 237
14.2.1 Extension of Analyses 237
14.2.2 Redesign of DMC and DMR Growth Mechanisms 238
14.2.3 Does the Growth Mechanism Define the Model Graph Type? 239
A 240
About this Dissertation
School | |
---|---|
Department | |
Degree | |
Submission | |
Language |
|
Research Field | |
Parola chiave | |
Committee Chair / Thesis Advisor | |
Committee Members |
Primary PDF
Thumbnail | Title | Date Uploaded | Actions |
---|---|---|---|
Algorithmic Approaches to Classifying Biological Networks () | 2018-08-28 16:13:49 -0400 |
|
Supplemental Files
Thumbnail | Title | Date Uploaded | Actions |
---|