Synthetic Generation of Datasets With Complex Attributes Público

Jiang, Yunfan (Spring 2022)

Permanent URL: https://etd.library.emory.edu/concern/etds/z603qz72f?locale=pt-BR
Published

Abstract

Synthetic data generation has been applied usefully in many domains, when real data is unavailable, for example, due to logistical, technical, or policy issues, like privacy or security. However, there is a lack of study on synthetic data generations of complex datasets that contain temporal features or categorical features with a large number of labels. This thesis focuses on identifying a robust generative method to produce accurate results for such datasets. We studied a university student dataset with the DataSynthesizer tool. We developed and tuned appropriate data preprocessing, data encodings and model configurations to improve the quality of the synthesized results. Using both standard visualizations and student-specific statistics, we show how our approach can achieve feasible synthetic results that are as good as those from less complex datasets.

Table of Contents

Contents

1 Introduction 1

2 Background & Related Work 3

2.1 Input & Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Synthetic Data Generation Methods . . . . . . . . . . . . . . . . . . . . 5

2.2.1 Some Synthetic Data Generation Methods . . . . . . . . . . . 5

2.2.2 Bayesian Network . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Comparison to Our Work . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Methodology 10

3.1 The DataSynthesizer Tool . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2 Evaluation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Experiment & Results 15

4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.2 Experimental methodology . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2.2 Using DataSynthesizer . . . . . . . . . . . . . . . . . . . . . . 18

4.3 DataSynthesizer with Default Parameters . . . . . . . . . . . . . . . . . 18

4.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 18

4.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 19

ii

4.3.3 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.4 Incorporating a Split-encode Algorithm . . . . . . . . . . . . . . . . . . 26

4.4.1 Split-encode Method & Testing Methodology . . . . . . . . . . 26

4.4.2 Experimental Setup & Results . . . . . . . . . . . . . . . . . . 28

4.4.3 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.5 DataSynthesizer with ID-encoded Data . . . . . . . . . . . . . . . . . . 32

4.5.1 Experimental Methodology & Setup . . . . . . . . . . . . . . . 33

4.5.2 Results & Analysis . . . . . . . . . . . . . . . . . . . . . . . . 33

4.6 Evaluation statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5 Conclusion & Future Work 39

Appendix A More Experimental Details 41

Bibliography 43

About this Honors Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
Palavra-chave
Committee Chair / Thesis Advisor
Committee Members
Última modificação

Primary PDF

Supplemental Files