Measurement and Analysis Methods of Performance Problems in Distributed Systems Público
Zhang, Lei (Fall 2021)
Abstract
Today's distributed systems invest significant computational and storage resources to accommodate their large scale of data, but more resources do not automatically improve performance. To deliver high performance, new types of large-scale solutions, such as the cloud computing and microservices paradigms, follow the design of deploying loosely coupled components that perform but, in the process, make it harder to maintain a global view of system performance. With the ensuing growing complexity of system architectures, diagnosing and understanding performance problems has become both critically important and highly challenging.
The aim of this thesis is to fill in some missing but significant parts towards monitoring and analyzing performance problems in distributed systems, by asking the question: What is the performance bottleneck of distributed systems performance, and how should we improve it? First, my thesis proposes a novel retroactive tracing abstraction where full telemetry information about a distributed request can be retrieved "back in time" soon after a problem is detected without unduly burdening any node in the system, with an always-on distributed tracing system. Second, my thesis frames the challenges of data placement in modern memory hierarchies in a generalized paging model outside of traditional assumptions, and provides an offline data placement algorithm towards optimal placement decisions. Last, my thesis derives a rule-of-thumb expression for cache warmup times, specifically how long caches in storage systems and CDNs need to be warmed up before their performance is deemed to be stable.
Table of Contents
1 Introduction 1
1.1 Performance is Key 1
1.1.1 Challenges 3
1.1.2 Opportunities 4
1.2 Contributions 5
1.3 Thesis Overview 6
2 Background 7
2.1 Modern Distributed Systems 8
2.2 Performance Symptoms 9
2.2.1 Control Plane 10
2.2.2 Data Plane 11
2.3 Quantifying Performance 12
2.3.1 Measurement Methods 13
2.3.2 Analysis Methods 14
3 Tracing Edge Cases With Hindsight 16
3.1 Motivation 19
3.1.1 Trace Collection Infrastructure 21
3.1.2 Sampling Derails Edge-Case Analysis 22
3.2 Challenge 23
3.3 Retroactive Sampling 24
3.4 Design 26
3.4.1 Overview 26
3.4.2 API Compatibility 27
3.4.3 Data Coherence 28
3.4.4 Efficient Data Management 29
3.4.5 Divorcing Triggers from Traces 30
3.5 Implementation 31
3.5.1 Agent Data Management 31
3.5.2 Client Library 32
3.5.3 Agent 34
3.6 Results 36
3.6.1 Case Studies 37
3.6.2 Hindsight Tracing Performance 40
3.6.3 Retroactive Sampling 44
3.6.4 Comparison with the State-of-the-Art 45
3.7 Discussion 47
3.8 Related Work 49
3.9 Takeaway 50
4 Optimal Data Placement on Memory Hierarchy 51
4.1 Modern Memory Hierarchies 53
4.2 CHOPT: An Optimal Data Placement Algorithm 56
4.2.1 Generalized Model and Objective 56
4.2.2 Why Investigate Offline Performance 57
4.2.3 Algorithm Design 58
4.2.4 The Anatomy of CHOPT 59
4.2.5 Extending CHOPT to Multiple Layers 65
4.3 Analyzing Long Traces by Sampling 66
4.4 Results 70
4.4.1 Traces and Workloads 70
4.4.2 Experimental Setup 72
4.4.3 CHOPT Simulation Results 74
4.4.4 Spatial Sampling Accuracy Results 78
4.5 Discussion 81
4.6 Related Work 84
4.7 Takeaway 87
5 Estimation of Cache Warmup Time 89
5.1 Dynamic Cache Behavior 90
5.2 Understand Cache Warmup Process 92
5.3 Toward Rule-of-Thumb Cache Warmup Time Estimation 96
5.4 Results 97
5.5 Discussion 100
5.6 Related Work 101
5.7 Takeaway 101
6 Conclusion 103
6.1 Future Directions 104
Bibliography 107
About this Dissertation
School | |
---|---|
Department | |
Degree | |
Submission | |
Language |
|
Research Field | |
Palabra Clave | |
Committee Chair / Thesis Advisor | |
Committee Members |
Primary PDF
Thumbnail | Title | Date Uploaded | Actions |
---|---|---|---|
Measurement and Analysis Methods of Performance Problems in Distributed Systems () | 2021-11-19 12:39:08 -0500 |
|
Supplemental Files
Thumbnail | Title | Date Uploaded | Actions |
---|