Leveraging Distributed Tracing for Root Cause Localization in Microservice-Architectured Distributed Systems Open Access
Bapat, Nikhil (Spring 2023)
Abstract
Developers and administrators of distributed systems lack a tracing and diagnostic framework to quickly identify and address causes of performance regression. When attempting to diagnose an issue in a distributed system, developers and administrators can observe some poorly performing requests and collect distributed trace data from the services in the system, but they still struggle to parse the traces and localize the source of the problem. In this paper, we present a methodology that, given a set of traces, identifies and ranks system components that are most likely to be the cause of performance regression. This approach saves system administrators and developers time identifying root causes of poor performance and can more precisely pinpoint issues that may otherwise fail to be adequately diagnosed.
Table of Contents
1 Introduction 1
2 Motivation 5
3 Design 9
3.1 Problem Overview ............................ 9
3.2 Localization Metric............................ 11
3.3 Example: Applying Localization Method to Mock System . . . . . . . 12
3.4 Statistical Fault Localization Notation . . . . . . . . . . . . . . . . . 14
3.5 Other Root Causes ............................ 16
4 Implementation 18
4.1 Basic Implementation........................... 18
4.2 Sampling.................................. 19
4.3 Simulating Retroactive Sampling .................... 20
5 Experiments 21
5.1 Experimentation Process......................... 21
5.2 Latency Injection Process ........................ 22
5.3 Evaluation Metrics ............................ 23
5.4 Sampling.................................. 24
6 Results 26
7 Related Work 31
7.1 Distributed Traces ............................ 31
7.2 Retroactive Sampling........................... 33
7.3 DeathStarBench.............................. 33
7.4 Fault Localization............................. 34
7.5 Benchmarks................................ 35
7.6 Distributed System Outage Detection.................. 37
7.7 Bayes Factor................................ 38
8 Conclusion and Future Work 39
Bibliography 41
About this Honors Thesis
School | |
---|---|
Department | |
Degree | |
Submission | |
Language |
|
Research Field | |
Keyword | |
Committee Chair / Thesis Advisor | |
Committee Members |
Primary PDF
Thumbnail | Title | Date Uploaded | Actions |
---|---|---|---|
Leveraging Distributed Tracing for Root Cause Localization in Microservice-Architectured Distributed Systems () | 2023-04-10 12:23:24 -0400 |
|
Supplemental Files
Thumbnail | Title | Date Uploaded | Actions |
---|