Algorithmic Approaches to Classifying Biological Networks Open Access

Bray, Margaret Justice (2015)

Permanent URL: https://etd.library.emory.edu/concern/etds/2z10wr05h?locale=en
Published

Abstract

As technology has become more advanced, the ease with which data can be collected has improved. This has left researchers with copious amounts of information, so much information that previous analytical techniques fall short. This has led to an increase in the popularity of representing data with networks. However, this data often have errors. This makes any conclusion gleamed from the analysis of a network unreliable. One type of network for which the inaccuracies are a particular issue is the protein-protein interaction (PPI) network. Researchers would like to use these networks to detect and diagnosis diseases by identifying specific interactions. Unfortunately, the errors in the networks make this impossible. One way to fix this is classify the empirical network into a category of model graph. By doing so, we will be able to mathematically predict which interactions are legitimate, and which are not. In this dissertation, we begin by testing the classification accuracy of five algorithms: degree distribution distance (DDD), characteristic curve (CC), relative graphlet frequency (RGF), graphlet degree distribution using arithmetic mean (GDD (A)) and using geometric mean (GDD (G)). Overall accuracies were poor, ranging from 68% for the GDD (A) down to 47% for the DDD. With accuracies this low, it is difficult to trust the classification results for an empirical network of unknown origin. Therefore, we propose two solutions. First, we provide several modifications to both versions of the GDD. The reformulated GDD is more accurate, classifying 76% of known graphs correctly, while also performing the analysis with increased speed. Second, we present a new classification algorithm: cross scoring. This novel method works by comparing networks based on a pre-selected group of network measures. Each type of model graph is ranked by how close its measure value falls to the empirical value compared to the other model types considered. Points are awarded and the model type with the fewest points at the end of the comparisons is considered the best fit. Accuracy across twelve trials was 82.9% with a standard deviation of 0.98%. These results are an obvious improvement over the five original algorithms considered.

Table of Contents

1 Introduction to Graphs and Graph Theory 1

1.1 Review of Essential Graph Theory Elements 3

1.2 Network Measures 5

1.2.1 Small-World and Scale-Free 9

1.2.2 Centrality 13

1.3 Discussion 15

2 Introduction to the Structure of Biomolecular Networks 17

2.1 Protein-Protein Interactions 17

2.2 Sacchraomyces cerevisiae Protein-Protein Interaction Network 18

2.3 Motivation for Network Classification 21

3 Introduction to Model Graphs 23

3.1 Model Graph Descriptions 23

3.1.1 RandomStatic 25

3.1.2 Small-World 25

3.1.3 3D-Geometric 26

3.1.4 Linear Preferential Attachment 26

3.1.5 Random Growing 27

3.1.6 Aging Vertex 28

3.1.7 Duplication-Mutation-Complementation and Duplication-Mutation using Random Mutation 28

3.1.8 Stickiness Model 29

3.2 Methods 30

3.3 Results 30

3.3.1 Numbers of Nodes and Edges 30

3.3.2 Density 34

3.3.3 Proportion of Nodes in the Giant Component 36

3.3.4 Diameter, Radius, and Average Shortest Path Length 38

3.3.5 Average Degree and Assortativity 42

3.3.6 S-metric 45

3.3.7 Clustering 46

3.3.8 Centralities: Betweenness, Closeness, Degree, and Eigenvector 49

3.4 Discussion 53

4 Measure Based Comparison of Model Graphs v Saccharomyces cerevisiae PPI Network 57

4.1 Methods 57

4.2 Results 58

4.2.1 Size Measures 58

4.2.2 Distance Measures 63

4.2.3 Centrality Measures 67

4.2.4 Connection Measures 69

4.2.5 Biologically Significant Measures 72

4.2.6 Summary of Measure Based Comparison Broken Down by Category 74

4.3 Discussion 75

5 Introduction to Network Classification Methods 78

5.1 Network Classification Methods 79

5.1.1 Relative Graphlet Frequency and Graphlet Degree Distribution 79

5.1.2 Characteristic Curve 82

5.1.3 Degree Distribution Distance 85

5.2 Limitations of Previous Work 86

6 Random Graph Classification 87

6.1 Methods 87

6.2 Results 90

6.3 Discussion 94

7 Model Graph Classification 96

7.1 Methods 96

7.2 Results 99

7.2.1 Filtering Out Large Graphs 99

7.2.2 Degree Distribution Distance 100

7.2.3 Characteristic Curve 105

7.2.4 Relative Graphlet Frequency 109

7.2.5 Graphlet Degree Distribution 112

7.2.6 Comparison of Classification Accuracy Broken Down by Model Type and Method 117

7.2.7 Patterns in Statistical Performance 119

7.2.8 Treatment of DMC and DMR Model Graphs 121

7.3 Discussion 123

7.3.1 Strengths and Limitations 126

7.3.2 Next Steps 129

8 Saccharomyces cerevisiae PPI Network Classification 130

8.1 Methods 130

8.2 Results 132

8.2.1 Degree Distribution Distance 132

8.2.2 Characteristic Curve 134

8.2.3 Relative Graphlet Frequency 135

8.2.4 Graphlet Degree Distribution 137

8.2.5 Kendall's W Comparison of Ranking Lists 139

8.3 Discussion 140

9 Relative Graphlet Frequency

9.1 Formula Error 142

9.2 Methods 143

9.3 Results 144

9.3.1 Random Graph Classification 144

9.3.2 Model Graph Classification 144

9.3.3 Saccharomyces cerevisiae PPI Network Classification 148

9.4 Conclusions 151

10 Reformulations of the Graphlet Degree Distribution 153

10.1 Graphlet Degree Distribution Issues 154

10.1.1 Geometric Mean 155

10.1.2 Contradictory Outcomes 158

10.1.3 Scaling and Normalization 164

10.2 Methods 165

10.2.1 Version 1 166

10.2.2 Version 2 166

10.2.3 Version 3 167

10.2.4 Analysis of Performance 168

10.3 Results 168

10.3.1 Model Graph Classification 168

10.3.2 Comparison of the Original Graphlet Degree Distribution to the Reformulated Versions 176

10.3.3 Saccharomyces cerevisiae PPI Network Classification 179

10.4 Conclusions 181

11 Designing the Cross Scoring Algorithm 182

11.1 Methods 182

11.1.1 Measures of Center and Spread 183

11.1.2 Nonlinear Scoring 183

11.1.3 Zeroing 184

11.1.4 Approximations 185

11.1.5 Tie Breaking 185

11.2 Data 186

11.3 Results 187

11.3.1 Mean Results 188

11.3.2 Median Results 189

11.3.3 Comparison of Mean and Median Results 192

11.4 Discussion 193

12 Determining the Cross Scoring Measure List 195

12.1 Methods 195

12.1.1 Macro- v Micro-Scoring 196

12.1.2 Importance-Scoring 196

12.2 Data 196

12.3 Results 198

12.3.1 Macro-lists 199

12.3.2 Micro-list 199

12.3.3 Importance-Scoring 201

12.4 Discussion 201

13 Applying the Cross Scoring Algorithm 203

13.1 Methods 203

13.2 Data 204

13.3 Results 204

13.3.1 Measure Selection 204

13.3.2 Macro-Lists 206

13.3.3 Macro-Scoring Performance 208

13.3.4 Micro-Scoring Performance 213

13.3.5 Macro- v Micro-Scoring Results 218

13.3.6 Classification of the S. cerevisiae PPI Network 220

13.3.7 Importance-Scoring 223

13.3.8 Robustness 225

13.3.9 Comparison of All Classifiers 225

13.4 Discussion 227

13.4.1 Biological Implications and their Effect on the Cross Scoring Design 229

13.4.2 Strengths and Limitations 230

14 Summary 232

14.1 Overview 232

14.1.1 Which Model Type is the Best Fit for the Saccharomyces cerevisiae PPI Network? 235

14.1.2 Do PPI Networks Exhibit Scale-Free Properties? 236

14.2 Future Work 237

14.2.1 Extension of Analyses 237

14.2.2 Redesign of DMC and DMR Growth Mechanisms 238

14.2.3 Does the Growth Mechanism Define the Model Graph Type? 239

A 240

About this Dissertation

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
Keyword
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files