Efficiently Optimizing HPC Application Design Across a Heterogeneous Hardware Environment Restricted; Files Only
Chen, Si (Fall 2024)
Abstract
Efficiently developing high-performance computing (HPC) applications is essential for optimizing performance and reducing economic costs. However, the inherent complexity of these applications, along with diverse heterogeneous hardware environments, poses significant challenges in application optimization. Heterogeneity complicates optimization due to differing memory architectures, processing capabilities, and communication patterns. Researchers often rely on proxy applications and simulations to estimate performance on various hardware platforms, but proxy applications often lack rigorous quantitative evaluation of their fidelity, and cycle-level accurate simulation remains inefficient, even with acceleration tools.
This dissertation addresses these challenges through three contributions. First, we develop a robust toolkit for characterizing and quantifying behavior similarities between HPC proxy applications and their corresponding parent applications. This ensures high fidelity in performance estimation and enhances the reliability of proxy applications in representing complex HPC applications. By identifying the most important features, we reduce data collection time by up to 95% while maintaining accuracy in representation.
Second, we improve a widely used simulation acceleration tool by integrating advanced clustering methods. This achieves a 5x speed up in simulation time while maintaining accuracy, enabling efficient exploration of design spaces for HPC applications across various hardware configurations. The enhanced simulation capabilities provide researchers with faster and more reliable means to evaluate application performance.
Third, we introduce a generalized model that combines meta-learning with architecture simulation to predict runtime across various applications and hardware systems. This approach facilitates rapid performance assessments and informed decision-making in HPC application design, achieving a 127x speedup in training time for additional tasks compared to traditional machine learning methods. The model's predictions can be practically applied to inform resource allocation and guide design choices in actual HPC workflows.
These three components address representation accuracy, simulation efficiency, and performance prediction. Together, they form a comprehensive framework for optimizing HPC application design process, making it faster, more cost-effective, and more adaptable to heterogeneous computing environments. This framework is designed to evolve alongside advancements in hardware, supporting new architectures and adapting to shifts in HPC workloads, ensuring its continued relevance in future HPC ecosystems.
Table of Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Main Research Question . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Quantify the Fidelity of Proxy Applications . . . . . . . . . . 8
1.3.2 Accelerate Application Simulation . . . . . . . . . . . . . . . . 9
1.3.3 Generalize HPC Application Runtime Prediction . . . . . . . 10
2 Background 12
2.1 HPC Application Characterization . . . . . . . . . . . . . . . . . . . 12
2.1.1 Building Proxy Application . . . . . . . . . . . . . . . . . . . 12
2.1.2 Proxy Application Characterization . . . . . . . . . . . . . . . 13
2.2 Accelerated Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Simulator and Sampling Method . . . . . . . . . . . . . . . . 15
2.2.2 SimPoint and Its Extensions . . . . . . . . . . . . . . . . . . . 16
2.2.3 Recent Advancements in SimPoint . . . . . . . . . . . . . . . 17
2.3 HPC Application Runtime Prediction . . . . . . . . . . . . . . . . . . 19
2.3.1 Application Specific Performance Evaluation . . . . . . . . . . 19
2.3.2 The Role of Machine Learning . . . . . . . . . . . . . . . . . . 20
2.3.3 Cross-platform Performance Prediction . . . . . . . . . . . . . 21
2.3.4 Meta-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Beyond Guess and Check: Quantifying the Fidelity of Proxy Applications 26
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Hardware Performance Counters . . . . . . . . . . . . . . . . . 28
3.2.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.3 Similarity and Distance . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 Application Suite . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.2 System Platform . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.3 Data Collection and Preprocessing . . . . . . . . . . . . . . . 40
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.1 Similarity Matrix Comparison . . . . . . . . . . . . . . . . . . 41
3.4.2 Root Cause Analysis . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.3 Feature Selection and Feature sensitivity . . . . . . . . . . . . 49
3.4.4 Feature Standard Deviation . . . . . . . . . . . . . . . . . . . 51
3.4.5 Subgroup Features . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4.6 Evaluation on Network Counters . . . . . . . . . . . . . . . . 54
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4 SimPoint++: Advanced Sampled HPC Application Simulation 58
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.1 Original SimPoint Workflow . . . . . . . . . . . . . . . . . . . 59
4.2.2 Random Projection . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.3 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.4 Why do we need to replace BIC in SimPoint? . . . . . . . . . 63
4.2.5 The Process of How SimPoint Finds the Optimal K . . . . . . 64
4.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.1 Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.2 Optimized K-means Clustering . . . . . . . . . . . . . . . . . 69
4.3.3 Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.5.1 Finding the Best K . . . . . . . . . . . . . . . . . . . . . . . . 78
4.5.2 Speedup and Accuracy . . . . . . . . . . . . . . . . . . . . . 80
4.5.3 Comparison with Spectral Clustering . . . . . . . . . . . . . . 81
4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5 METACAST: Generalizing HPC Application Runtime Prediction 85
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2.1 Multi Task Data Collection . . . . . . . . . . . . . . . . . . . 87
5.2.2 Meta-Model Training . . . . . . . . . . . . . . . . . . . . . . . 88
5.2.3 Target Task Data Collection . . . . . . . . . . . . . . . . . . . 91
5.2.4 Target Task Model Training . . . . . . . . . . . . . . . . . . . 92
5.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.3.1 Simulation Platform . . . . . . . . . . . . . . . . . . . . . . . 92
5.3.2 Application Workload . . . . . . . . . . . . . . . . . . . . . . 93
5.3.3 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3.4 Hyperparameter Optimization . . . . . . . . . . . . . . . . . . 96
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.4.1 Meta-model Accuracy for Benchmarks . . . . . . . . . . . . . 97
5.4.2 Cross-Architecture Generalizability . . . . . . . . . . . . . . . 101
5.4.3 Meta-model Accuracy for Real Applications . . . . . . . . . . 103
5.4.4 Time Efficiency in MetaCast versus Traditional Methods . . 105
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.5.1 Accuracy Considerations . . . . . . . . . . . . . . . . . . . . . 107
5.5.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6 Conclusion and Future Work 110
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
A Appendix 114
A.1 Real application information . . . . . . . . . . . . . . . . . . . . . . . 114
A.2 Real system configuration per node . . . . . . . . . . . . . . . . . . . 116
Bibliography 118
About this Dissertation
School | |
---|---|
Department | |
Degree | |
Submission | |
Language |
|
Research Field | |
Palavra-chave | |
Committee Chair / Thesis Advisor | |
Committee Members |

Primary PDF
Thumbnail | Title | Date Uploaded | Actions |
---|---|---|---|
![]() |
File download under embargo until 09 July 2025 | 2024-12-07 16:04:30 -0500 | File download under embargo until 09 July 2025 |
Supplemental Files
Thumbnail | Title | Date Uploaded | Actions |
---|