Synthetic Generation of Datasets With Complex Attributes Open Access
Jiang, Yunfan (Spring 2022)
Abstract
Synthetic data generation has been applied usefully in many domains, when real data is unavailable, for example, due to logistical, technical, or policy issues, like privacy or security. However, there is a lack of study on synthetic data generations of complex datasets that contain temporal features or categorical features with a large number of labels. This thesis focuses on identifying a robust generative method to produce accurate results for such datasets. We studied a university student dataset with the DataSynthesizer tool. We developed and tuned appropriate data preprocessing, data encodings and model configurations to improve the quality of the synthesized results. Using both standard visualizations and student-specific statistics, we show how our approach can achieve feasible synthetic results that are as good as those from less complex datasets.
Table of Contents
Contents
1 Introduction 1
2 Background & Related Work 3
2.1 Input & Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Synthetic Data Generation Methods . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Some Synthetic Data Generation Methods . . . . . . . . . . . 5
2.2.2 Bayesian Network . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Comparison to Our Work . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Methodology 10
3.1 The DataSynthesizer Tool . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Evaluation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Experiment & Results 15
4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Experimental methodology . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.2 Using DataSynthesizer . . . . . . . . . . . . . . . . . . . . . . 18
4.3 DataSynthesizer with Default Parameters . . . . . . . . . . . . . . . . . 18
4.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 19
ii
4.3.3 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 Incorporating a Split-encode Algorithm . . . . . . . . . . . . . . . . . . 26
4.4.1 Split-encode Method & Testing Methodology . . . . . . . . . . 26
4.4.2 Experimental Setup & Results . . . . . . . . . . . . . . . . . . 28
4.4.3 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 DataSynthesizer with ID-encoded Data . . . . . . . . . . . . . . . . . . 32
4.5.1 Experimental Methodology & Setup . . . . . . . . . . . . . . . 33
4.5.2 Results & Analysis . . . . . . . . . . . . . . . . . . . . . . . . 33
4.6 Evaluation statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5 Conclusion & Future Work 39
Appendix A More Experimental Details 41
Bibliography 43
About this Honors Thesis
School | |
---|---|
Department | |
Degree | |
Submission | |
Language |
|
Research Field | |
Keyword | |
Committee Chair / Thesis Advisor | |
Committee Members |
Primary PDF
Thumbnail | Title | Date Uploaded | Actions |
---|---|---|---|
Synthetic Generation of Datasets With Complex Attributes () | 2022-04-12 17:11:19 -0400 |
|
Supplemental Files
Thumbnail | Title | Date Uploaded | Actions |
---|