The Applications of Alternating Minimization Algorithms on Deep Learning Models Public
Wang, Junxiang (Fall 2022)
Abstract
Gradient Descent(GD) and its variants are the most popular optimizers for training deep learning models. However, they suffer from many challenges such as gradient vanishing and backward locking, which prevent their more widespread use. To address these intrinsic drawbacks, Alternating Minimization(AM) methods have attracted attention from researchers as a potential way to train deep learning models. Their idea is to decompose a neural network into a series of linear and nonlinear equality constraints, which generate multiple subproblems and they can be minimized alternately. Their empirical evaluations demonstrate good scalability and high accuracy. They also avoid gradient vanishing problems and allow for non-differentiable activation functions, as well as allowing for complex non-smooth regularization and the constraints that are increasingly important for neural network architectures.
This dissertation discusses the applications of AM methods on deep learning models, which can be categorized into three research topics: 1). AM methods on Multi-Layer Perceptron(MLP), which includes deep learning Alternating Direction Method of Multipliers(dlADMM), and monotonous Deep Learning Alternating Minimization(mDLAM). 2). AM methods on Graph Neural Network(GNN), which includes the Invertible Validity-aware Graph Diffusion(IVGD). 3). AM methods for distributed neural network training, which includes parallel deep learning Alternating Direction Method of Multipliers(pdADMM), pdADMM-G, and pdADMM-G-Q.
For the dlADMM algorithm, parameters in each layer are updated in a backward and forward fashion. The time complexity is reduced from cubic to quadratic in(latent) feature dimensions for subproblems by iterative quadratic approximations and backtracking. Finally, we provide the convergence guarantee of the dlADMM algorithm under mild conditions with a sublinear convergence rate $o(1/k)$.
For the mDLAM algorithm, our innovative inequality-constrained formulation infinitely approximates the original problem with non-convex equality constraints, enabling our convergence proof of the proposed mDLAM algorithm regardless of the choice of hyperparameters. Our mDLAM algorithm is shown to achieve a fast linear convergence by the Nesterov acceleration technique.
For the IVGD model, we unroll AM methods into GNN models for graph source localization. Specifically, first, to inversely infer sources of graph diffusion, we propose a graph residual scenario to make existing graph diffusion models invertible with theoretical guarantees; second, we develop a novel error compensation mechanism that learns to offset the errors of the inferred sources. Finally, to ensure the validity of the inferred sources, we unroll AM methods into the proposed validity-aware layers to project inferred sources to feasible regions by encoding validity constraints. A linearization technique is proposed to strengthen the efficiency of our proposed layers. The convergence of the proposed IVGD is proven theoretically.
For the pdADMM algorithm, we achieve model parallelism by breaking layer dependency: parameters in each layer of neural networks can be updated independently in parallel. The convergence of the proposed pdADMM to a stationary point is theoretically proven under mild conditions with a sublinear convergence rate $o(1/k)$.
For the pdADMM-G algorithm and the pdADMM-G-Q algorithm, in order to achieve model parallelism, we extend the proposed pdADMM algorithm to train the GA-MLP model, named the pdADMM-G algorithm. The extended pdADMM-G-Q algorithm reduces communication costs by introducing the quantization technique. Theoretical convergence to a (quantized) stationary point of two proposed algorithms is provided with a sublinear convergence rate $o(1/k)$.
Table of Contents
1 Introduction 1
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 The Organization of the dissertation . . . . . . . . . . . . . . . . . . 11
2 The dlADMM Algorithm 12
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 dlADMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Problem Transformation . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 the Proposed dlADMM Algorithm . . . . . . . . . . . . . . . 17
2.3.3 The Quadratic Approximation and Backtracking . . . . . . . . 20
2.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.2 Key Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 31
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3 The mDLAM Algorithm 37
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Model and Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.1 Inequality Approximation for Deep Learning . . . . . . . . . . 41
3.3.2 Alternating Optimization . . . . . . . . . . . . . . . . . . . . 42
3.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.1 Convergence Properties . . . . . . . . . . . . . . . . . . . . . . 47
3.4.2 Convergence of the Proposed mDLAM Algorithm . . . . . . . 48
3.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5.1 Datasets and Parameter Settings . . . . . . . . . . . . . . . . 50
3.5.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.5.4 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . 55
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4 The pdADMM Algorithm 59
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3 pdADMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.1 Problem Reformulation . . . . . . . . . . . . . . . . . . . . . . 63
4.3.2 Solutions to All Subproblems . . . . . . . . . . . . . . . . . . 65
4.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5.2 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5 The pdADMM-G and pdADMM-G-Q Algorithms 81
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3 The pdADMM-G Algorithm . . . . . . . . . . . . . . . . . . . . . . . 84
5.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.2 The pdADMM-G Algorithm . . . . . . . . . . . . . . . . . . . 86
5.3.3 Quantization Extension of pdADMM-G(pdADMM-G-Q) . . . 89
5.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.5.1 Datasets and Settings . . . . . . . . . . . . . . . . . . . . . . 97
5.5.2 Comparison Methods . . . . . . . . . . . . . . . . . . . . . . . 99
5.5.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.5.4 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.5.5 Communication Overheads . . . . . . . . . . . . . . . . . . . . 104
5.5.6 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6 Conclusion and Future Works 109
6.1 Research Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.1.1 The dlADMM Algorithm . . . . . . . . . . . . . . . . . . . . . 111
6.1.2 The mDLAM Algorithm . . . . . . . . . . . . . . . . . . . . . 113
6.1.3 The pdADMM Algorithm . . . . . . . . . . . . . . . . . . . . 114
6.1.4 The pdADMM-G and pdADMM-G-Q Algorithms . . . . . . . 115
6.1.5 The IVGD Model . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.3 Current Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.3.1 Contributions of Published Papers Contributing to dissertation 119
6.3.2 Published Papers During My Ph.D. . . . . . . . . . . . . . . . 120
6.3.3 Papers Published Before Ph.D. . . . . . . . . . . . . . . . . . 123
6.3.4 Submitted and In-preparation Papers . . . . . . . . . . . . . . 124
6.4 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . 125
6.4.1 Parallel Training of Graph Neural Networks on Large-Scale
Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.4.2 Stochastic AM Algorithms for Large-Scale Datasets . . . . . . 125
Appendix A Appendix of the dlADMM Algorithm 126
A.1 Algorithms to Update Wk+1
l and ak+1
l . . . . . . . . . . . . . . . . . . 126
A.2 Preliminary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
A.3 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
A.4 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Appendix B Appendix of the mDLAM Algorithm 142
B.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
B.2 Preliminary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
B.3 Main Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Appendix C Appendix of the pdADMM Algorithm 158
C.1 Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
C.2 Main Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Appendix D Appendix of the pdADMM-G Algorithm 170
D.1 Convergence Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
D.1.1 Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . 170
D.1.2 Proof of Lemma 8 . . . . . . . . . . . . . . . . . . . . . . . . . 175
D.1.3 Proof of Lemma 9 . . . . . . . . . . . . . . . . . . . . . . . . . 175
D.1.4 Proof of Theorem 9 . . . . . . . . . . . . . . . . . . . . . . . . 176
D.1.5 The proof of Theorem 10 . . . . . . . . . . . . . . . . . . . . . 181
D.1.6 The proof of Theorem 11 . . . . . . . . . . . . . . . . . . . . . 181
D.1.7 The proof of Theorem 12 . . . . . . . . . . . . . . . . . . . . . 182
D.2 More Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 184
D.2.1 The Settings of All Hyperparameters . . . . . . . . . . . . . . 184
D.2.2 The Performance of Validation Sets . . . . . . . . . . . . . . . 184
Bibliography 186
About this Dissertation
| School | |
|---|---|
| Department | |
| Degree | |
| Submission | |
| Language | 
 | 
| Research Field | |
| Mot-clé | |
| Committee Chair / Thesis Advisor | |
| Committee Members | 
Primary PDF
| Thumbnail | Title | Date Uploaded | Actions | 
|---|---|---|---|
|  | The Applications of Alternating Minimization Algorithms on Deep Learning Models () | 2022-11-28 18:29:01 -0500 |  | 
Supplemental Files
| Thumbnail | Title | Date Uploaded | Actions | 
|---|