Leveraging Distributed Tracing for Root Cause Localization in Microservice-Architectured Distributed Systems Open Access

Bapat, Nikhil (Spring 2023)

Permanent URL: https://etd.library.emory.edu/concern/etds/xk81jm490?locale=en
Published

Abstract

Developers and administrators of distributed systems lack a tracing and diagnostic framework to quickly identify and address causes of performance regression. When attempting to diagnose an issue in a distributed system, developers and administrators can observe some poorly performing requests and collect distributed trace data from the services in the system, but they still struggle to parse the traces and localize the source of the problem. In this paper, we present a methodology that, given a set of traces, identifies and ranks system components that are most likely to be the cause of performance regression. This approach saves system administrators and developers time identifying root causes of poor performance and can more precisely pinpoint issues that may otherwise fail to be adequately diagnosed. 

Table of Contents

1 Introduction 1

2 Motivation 5

3 Design 9

3.1 Problem Overview ............................ 9

3.2 Localization Metric............................ 11

3.3 Example: Applying Localization Method to Mock System . . . . . . . 12

3.4 Statistical Fault Localization Notation . . . . . . . . . . . . . . . . . 14

3.5 Other Root Causes ............................ 16

4 Implementation 18

4.1 Basic Implementation........................... 18

4.2 Sampling.................................. 19

4.3 Simulating Retroactive Sampling .................... 20

5 Experiments 21

5.1 Experimentation Process......................... 21

5.2 Latency Injection Process ........................ 22

5.3 Evaluation Metrics ............................ 23

5.4 Sampling.................................. 24

6 Results 26

7 Related Work 31

7.1 Distributed Traces ............................ 31

7.2 Retroactive Sampling........................... 33

7.3 DeathStarBench.............................. 33

7.4 Fault Localization............................. 34

7.5 Benchmarks................................ 35

7.6 Distributed System Outage Detection.................. 37

7.7 Bayes Factor................................ 38

8 Conclusion and Future Work 39

Bibliography 41 

About this Honors Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
Keyword
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files